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Preface 


Why Randomness? 


\Vh>. should computer scientists study and use randomness? Computers appear to 
behave far too unpredictably as it is! Adding randomness would seemingly be a dis- 
jd\ antage, adding further complications to the already challenging task of efficiently 
utilizing computers. 

Science has learned in the last century to accept randomness as an essential com¬ 
ponent in modeling and analyzing nature. In physics, for example, Newton’s laws led 
people to believe that the universe was a deterministic place; given a big enough calcu- 
iator and the appropriate initial conditions, one could determine the location of planets 
>ears from now. The development of quantum theory suggests a rather different view; 
the universe still behaves according to laws, but the backbone of these laws is proba¬ 
bilistic. “God does not play dice with the universe’’ was Einstein’s anecdotal objection 
：o modern quantum mechanics. Nevertheless, the prevailing theory today for subpar- 
:i^ie physics is based on random behavior and statistical laws, and randomness plays a 
'igniticant role in almost every other field of science ranging from genetics and evolu- 
: i«'n in biology to modeling price fluctuations in a free-market economy. 

Computer science is no exception. From the highly theoretical notion of proba¬ 
bilistic theorem proving to the very practical design of PC Ethernet cards, randomness 
and probabilistic methods play a key role in modern computer science. The last two 
decades have witnessed a tremendous growth in the use of probability theory in com¬ 
puting. Increasingly more advanced and sophisticated probabilistic techniques have 
rven developed for use within broader and more challenging computer science appli¬ 
cations. In this book, we study the fundamental ways in which randomness comes 
U' bear on computer science: randomized algorithms and the probabilistic analysis of 
algorithms. 

Randomized algorithms: Randomized algorithms are algorithms that make random 
choices during their execution. In practice, a randomized program would use values 
generated by a random number generator to decide the next step at several branches 


xiii 






PREFACE 


of its execution. For example, the protocol implemented in an Ethernet card uses ran¬ 
dom numbers to decide when it next tries to access the shared Ethernet communication 
medium. The randomness is useful for breaking symmetry, preventing different cards 
from repeatedly accessing the medium at the same time. Other commonly used ap¬ 
plications of randomized algorithms include Monte Carlo simulations and primality 
testing in cryptography. In these and many other important applications, randomized 
algorithms are significantly more efficient than the best known deterministic solutions. 
Furthermore, in most cases the randomized algorithms are also simpler and easier to 


program. 

These gains come at a price; the answer may have some probability of being incor¬ 
rect, or the efficiency is guaranteed only with some probability. Although it may seem 
unusual to design an algorithm that may be incorrect, if the probability of error is suf¬ 
ficiently small then the improvement in speed or memory requirements may well be 
worthwhile. 

Probabilistic analysis of algorithms: Complexity theory tries to classify computa¬ 
tion problems according to their computational complexity, in particular distinguish¬ 
ing between easy and hard problems. For example, complexity theory shows that the 
Traveling Salesmen problem is NP-hard. It is therefore very unlikely that there is an 
algorithm that can solve any instance of the Traveling Salesmen problem in time that 
is subexponential in the number of cities. An embarrassing phenomenon for the clas¬ 
sical worst-case complexity theory is that the problems it classifies as hard to compute 
are often easy to solve in practice. Probabilistic analysis gives a theoretical explanation 
for this phenomenon. Although these problems may be hard to solve on some set of 
pathological inputs, on most inputs (in particular, those that occur in real-life applica¬ 
tions) the problem is actually easy to solve. More precisely, if we think of the input as 
being randomly selected according to some probability distribution on the collection of 
all possible inputs, we are very likely to obtain a problem instance that is easy to solve, 
and instances that are hard to solve appear with relatively small probability. Probabilis¬ 
tic analysis of algorithms is the method of studying how algorithms perform when the 
input is taken from a well-defined probabilistic space. As we will see, even NP-hard 
problems might have algorithms that are extremely efficient on almost all inputs. 


The Book 

This textbook is designed to accompany one- or two-semester courses for advanced 
undergraduate or beginning graduate students in computer science and applied math¬ 
ematics. The study of randomized and probabilistic techniques in most leading uni¬ 
versities has moved from being the subject of an advanced graduate seminar meant 
for theoreticians to being a regular course geared generally to advanced undergraduate 
and beginning graduate students. There are a number of excellent advanced, research- 
oriented books on this subject, but there is a clear need for an introductory textbook. 
We hope that our book satisfies this need. 

The textbook has developed from courses on probabilistic methods in computer sci¬ 
ence taught at Brown (CS 155) and Harvard (CS 223) in recent years. The emphasis 
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in these courses and in this textbook is on the probabilistic techniques and paradigms, 
not on particular applications. Each chapter of the book is devoted to one such method 
or technique. Techniques are clarified though examples based on analyzing random¬ 
ized algorithms or developing probabilistic analysis of algorithms on random inputs. 
Many of these examples are derived from problems in networking, reflecting a promi¬ 
nent trend in the networking field (and the taste of the authors). 

The book contains fourteen chapters. We may view the book as being divided into 
:\vo parts, where the first part (Chapters 1-7) comprises what we believe is core mate¬ 
rial. The book assumes only a basic familiarity with probability theory, equivalent to 
\\ hat is covered in a standard course on discrete mathematics for computer scientists. 
Chapters 1-3 review this elementary probability theory while introducing some inter¬ 
esting applications. Topics covered include random sampling, expectation, Markov’s 
inequality, variance, and Chebyshev’s inequality. If the class has sufficient background 
in probability, then these chapters can be taught quickly. We do not suggest skipping 
them, however, because they introduce the concepts of randomized algorithms and 
probabilistic analysis of algorithms and also contain several examples that are used 
throughout the text. 

Chapters 4-7 cover more advanced topics, including Chernoff bounds, balls-and- 
hins models, the probabilistic method, and Markov chains. The material in these chap- 
：lts is more challenging than in the initial chapters. Sections that are particularly chal¬ 
lenging (and hence that the instructor may want to consider skipping) are marked with 
an asterisk. The core material in the first seven chapters may constitute the bulk of a 
quarter- or semester-long course, depending on the pace. 

The second part of the book (Chapters 8-14) covers additional advanced material 
that can be used either to fill out the basic course as necessary or for a more advanced 
、econd course. These chapters are largely self-contained, so the instructor can choose 
the topics best suited to the class. The chapters on continuous probability and en¬ 
tropy are perhaps the most appropriate for incorporating into the basic course. Our 
introduction to continuous probability (Chapter 8) focuses on uniform and exponential 
distributions, including examples from queueing theory. Our examination of entropy 
'Chapter 9) shows how randomness can be measured and how entropy arises naturally 
in the context of randomness extraction, compression, and coding. 

Chapters 10 and 11 cover the Monte Carlo method and coupling, respectively; these 
chapters are closely related and are best taught together. Chapter 12, on martingales, 
-overs important issues on dealing with dependent random variables, a theme that con- 
unues in a different vein in Chapter 13\s development of pairwise independence and 
derandomization. Finally, the chapter on balanced allocations (Chapter 14) covers a 
topic close to the authors’ hearts and ties in nicely with Chapter 5’s analysis of balls- 
and-bins problems. 

The order of the subjects, especially in the first part of the book, corresponds to 
: heir relative importance in the algorithmic literature. Thus, for example, the study 
of Chernoff bounds precedes more fundamental probability concepts such as Markov 
chains. However, instructors may choose to teach the chapters in a different order. A 
course with more emphasis on general stochastic processes, for example, may teach 
Markov chains (Chapter 7) immediately after Chapters 1-3, following with the chapter 
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on balls, bins, and random graphs (Chapter 5, omitting the Hamiltonian cycle exam¬ 
ple). Chapter 6 on the probabilistic method could then be skipped, following instead 
with continuous probability and the Poisson process (Chapter 8). The material from 
Chapter 4 on Chernoff bounds, however, is needed for most of the remaining material. 

Most of the exercises in the book are theoretical, but we have included some program¬ 
ming exercises - including two more extensive exploratory assignments that require 
some programming. We have found that occasional programming exercises are often 
helpful in reinforcing the book’s ideas and in adding some variety to the course. 

We have decided to restrict the material in this book to methods and techniques based 
on rigorous mathematical analysis; with few exceptions, all claims in this book are fol¬ 
lowed by full proofs. Obviously, many extremely useful probabilistic methods do not 
fall within this strict category. For example, in the important area of Monte Carlo meth¬ 
ods, most practical solutions are heuristics that have been demonstrated to be effective 
and efficient by experimental evaluation rather than by rigorous mathematical analy¬ 
sis. We have taken the view that, in order to best apply and understand the strengths 
and weaknesses of heuristic methods, a firm grasp of underlying probability theory and 
rigorous techniques - as we present in this book - is necessary. We hope that students 
will appreciate this point of view by the end of the course. 
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CHAPTER ONE 


Events and Probability 


This chapter introduces the notion of randomized algorithms and reviews some basic 
concepts of probability theory in the context of analyzing the performance of simple 
randomized algorithms for verifying algebraic identities and finding a minimum cut-set 
in a graph. 


1.1. Application: Verifying Polynomial Identities 


Computers can sometimes makes mistakes, due for example to incorrect programming 
or hardware failure. It would be useful to have simple ways to double-check the results 
of computations. For some problems, we can use randomness to efficiently verify the 
correctness of an output. 

Suppose we have a program that multiplies together monomials. Consider the prob¬ 
lem of verifying the following identity, which might be output by our program: 

(x + l)(x — 2){x + 3)(x — 4)(x + 5)(x — 6) = x 6 — 7x 3 + 25. 

There is an easy way to verify whether the identity is correct: multiply together the 
terms on the left-hand side and see if the resulting polynomial matches the right-hand 
side. In this example, when we multiply all the constant terms on the left, the result 
does not match the constant term on the right, so the identity cannot be valid. More 
generally, given two polynomials F{x) and G(x), we can verify the identity 

F{x) = G{x) 

by converting the two polynomials to their canonical forms two polyno¬ 

mials are equivalent if and only if all the coefficients in their canonical forms are equal. 
From this point on let us assume that, as in our example, F{x) is given as a product 
Fix) — nf = i(-^ — a,.) and G(x) is given in its canonical form. Transforming F(x) to 
its canonical form by consecutively multiplying the ;th monomial with the product of 
the first i — 1 monomials requires G(d 2 ) multiplications of coefficients. We assume in 
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it follows that each multiplication can be performed in constant time, although if 
products of the coefficients grow large then it could conceivably require more than 
stant time to add and multiply numbers together. 

io far, we have not said anything particularly interesting. To check whether the 
lputer program has multiplied monomials together correctly, we have suggested 
tiplying the monomials together again to check the result. Our approach for check- 
the program is to write another program that does essentially the same thing we 
set the first program to do. This is certainly one way to double-check a program: 
e a second program that does the same thing, and make sure they agree. There 
at least two problems with this approach, both stemming from the idea that there 
jld be a difference between checking a given answer and recomputing it. First, if 
e is a bug in the program that multiplies monomials, the same bug may occur in the 
: king program. (Suppose that the checking program was written by the same per- 
who wrote the original program!) Second, it stands to reason that we would like 
heck the answer in less time than it takes to try to solve the original problem all 
r again. 

>et us instead utilize randomness to obtain a faster method to verify the identity. We 
irmally explain the algorithm and then set up the formal mathematical framework 
analyzing the algorithm. 

vssume that the maximum degree, or the largest exponent of x, in F(x) and G(x) is 
’he algorithm chooses an integer r uniformly at random in the range {1,..., 100J}, 
: re by ‘‘uniformly at random’’ we mean that all integers are equally likely to be 
sen. The algorithm then computes the values F{r) and G(r). If F{r) ^ G(r) the 
)rithm decides that the two polynomials are not equivalent, and if F(r) = G(r) the 
)rithm decides that the two polynomials are equivalent. 

►uppose that in one computation step the algorithm can generate an integer cho- 

uniformly at random in the range {1,_100t/}. Computing the values of F(r) and 

.)can be done in 0(d ) time, which is faster than computing the canonical form of 
).The randomized algorithm, however, may give a wrong answer, 
low can the algorithm give the wrong answer? 

f Fix) = G(x). then the alsorithm eives the correct answer, since it will find that 


)ts. Thus, if Fix) ^ G(x), then there are no more than d values in the 
OJ} for which F{r) = G(r). Since there are \00d values in the range 
le chance that the algorithm chooses such a value and returns a wrong 
〕re than 1/100. 
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1.2. Axioms of Probability 

We turn now to a formal mathematical setting for analyzing the randomized algorithm. 
Any probabilistic statement must refer to the underlying probability space. 

Definition 1.1: A probability space has three components: 

1. a sample space Q, which is the set of all possible outcomes of the random process 
modeled by the probability space., 

2. a family of sets T representing the allowable events, where each set in T is a subset 
of the sample space Q; and 

3. a probability function Pr: R satisfying Definition 1.2. 

An element of ^ is called a simple or elementary event. 

In the randomized algorithm for verifying polynomial identities, the sample space 
is the set of integers {1，• • • ， 100t/}. Each choice of an integer r in this range is a simple 
event. 

Definition 1.2 ： A probability function is any function Pr: R that satisfies the 

following conditions: 

1. for any event E,0 < Pr(E) < 1; 

2. Pr(Q) = 1; and 

3. for any finite or countably infinite sequence of pairwise mutually disjoint events 
E\,Ei, E^, ... ， 

Pr (U £ ') = E Pr ( £ '). 

In most of this book we will use discrete probability spaces. In a discrete probability 
space the sample space Q is finite or countably infinite, and the family T of allow¬ 
able events consists of all subsets of Q. In a discrete probability space, the probability 
function is uniquely defined by the probabilities of the simple events. 

Again, in the randomized algorithm for verifying polynomial identities, each choice 
of an integer r is a simple event. Since the algorithm chooses the integer uniformly at 
random, all simple events have equal probability. The sample space has \00d simple 
events, and the sum of the probabilities of all simple events must be 1. Therefore each 
、nnple event has probability 1/100c/. 

Because events are sets, we use standard set theory notation to express combinations 
iif events. We write E\ H E 2 for the occurrence of both E\ and Ei and write E\ U E 2 
t or the occurrence of either E\ ox E 2 (or both). For example, suppose we roll two dice. 
HE] is the event that the first die is a 1 and Ei is the event that the second die is a 1 ， 
then £1 fl £2 denotes the event that both dice are 1 while E\ U Ej denotes the event that 
at least one of the two dice lands on 1. Similarly, we write E\ — E 2 for the occurrence 
of an event that is in E\ but not in Ei- With the same dice example, E\ — Ei consists 
of the event where the first die is a 1 and the second die is not. We use the notation E 
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as shorthand for Q — E: for example, if E is the event that we obtain an even number 
when rolling a die, then E is the event that we obtain an odd number. 

Definition 1.2 yields the following obvious lemma. 

Lemma 1.1: For any two events E\ and £ 2 - 

Pr(£'i U E 2 ) = Pr(£'i) + Pr(£ 2 ) — Pr(£i fl E 2 ). 


Proof: From the definition, 

Pr(£,) = Pr(£i - (E { fl £ 2 )) + Pr(£, fl £ 2 ), 

?r(E 2 ) = Pr(£ 2 — (& fl £ 2 )) + Pr(£i fl £：), 

Pr(£, U E 2 ) = Pr(£i - (E { fl E 2 )) + Pr(£ 2 — (£] n E 2 )) + Pr(£] fl E 2 ). 

The lemma easily follows. ■ 

A consequence of Definition 2 is known as the union bound. Although it is very sim¬ 
ple, it is tremendously useful. 

Lemma 1.2: For any finite or countably infinite sequence of events E\, E 2 , ■■■, 

eA < jyriEi). 

)i>] 

Notice that Lemma 1.2 differs from the third part of Definition 1.2 in that Definition 1.2 
is an equality and requires the events to be pairwise mutually disjoint. 

Lemma 1.1 can be generalized to the following equality, often referred to as the 
inclusion-exclusion principle. 

Lemma 1.3: Let E\ . E n be any n events. Then 

= ^Pr(£ f ) - ^]Pr(^ fl E/) 

+ Pr(&nA 门 ~) 



The proof of the inclusion-exclusion principle is left as Exercise 1.7. 

We showed before that the only case in which the algorithm may fail to give the cor¬ 
rect answer is when the two input polynomials F(x) and G(x) are not equivalent; the 
algorithm then gives an incorrect answer if the random number it chooses is a root of 
the polynomial F(x) — G(x). Let E represent the event that the algorithm failed to 
give the correct answer. The elements of the set corresponding to E are the roots of the 
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polynomial F(x) — G(x) that are in the set of integers {1,..., 100(：/}. Since the poly¬ 
nomial has no more than d roots it follows that the event E includes no more than d 
simple events, and therefore 


d 1 

Pr(algorithm fails) = Pr(£) < - = —— . 

5 \00d 100 

It may seem unusual to have an algorithm that can return the wrong answer. It may 
help to think of the correctness of an algorithm as a goal that we seek to optimize in 
conjunction with other goals. In designing an algorithm, we generally seek to mini¬ 
mize the number of computational steps and the memory required. Sometimes there is 
a trade-off; there may be a faster algorithm that uses more memory or a slower algo¬ 
rithm that uses less memory. The randomized algorithm we have presented gives a 
trade-off between correctness and speed. Allowing algorithms that may give an incor¬ 
rect answer (but in a systematic way) expands the trade-off space available in designing 
algorithms. Rest assured, however, that not all randomized algorithms give incorrect 
answers, as we shall see. 

For the algorithm just described, the algorithm gives the correct answer 99% of the 
time even when the polynomials are not equivalent. Can we improve this probability? 
One way is to choose the random number r from a larger range of integers. If our sam¬ 
ple space is the set of integers {1,..1000t/}, then the probability of a wrong answer 
1 、 at most 1/1000. At some point, however, the range of values we can use is limited 
h>' the precision available on the machine on which we run the algorithm. 

Another approach is to repeat the algorithm multiple times, using different random 
\ alues to test the identity. The property we use here is that the algorithm has a one-sided 
t rror. The algorithm may be wrong only when it outputs that the two polynomials are 
equivalent. If any run yields a number r such that F(r) ^ G(r), then the polynomi¬ 
als are not equivalent. Thus, if we repeat the algorithm a number of times and find 
F\r) ^ G(r) in at least one round of the algorithm, we know that F(x) and G(x) are 
not equivalent. The algorithm outputs that the two polynomials are equivalent only if 
there is equality for all runs. 

In repeating the algorithm we repeatedly choose a random number in the range 

1.100c/}, Repeatedly choosing random numbers according to a given distribution 

:、 generally referred to as sampling. In this case, we can repeatedly choose random 

numbers in the range {1.l(Xk/} in two ways: we can sample either with replacement 

or w ithout replacement. Sampling with replacement means that we do not remember 
..、hich numbers we have already tested; each time we run the algorithm, we choose 
-i number uniformly at random from the range {1,..., 100c/} regardless of previous 
choices, so there is some chance we will choose an r that we have chosen on a previ¬ 
ous run. Sampling without replacement means that, once we have chosen a number r, 
■a c do not allow the number to be chosen on subsequent runs; the number chosen at a 
£i\en iteration is uniform over all previously unselected numbers. 

Let us first consider the case where sampling is done with replacement. Assume 
:hat we repeat the algorithm k times, and that the input polynomials are not equiva¬ 
lent. What is the probability that in all k iterations our random sampling from the set 
!.KXk/} yields roots of the polynomial F(x) — G(x), resulting in a wrong output 
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by the algorithm? If k = 1 , we know that this probability is at most d/\00d = 1/100. 
If k = 2, it seems that the probability that the first iteration finds a root is 1/100 and 
the probability that the second iteration finds a root is 1/100, so the probability that 
both iterations find a root is at most (1/100) 2 . Generalizing, for any k, the probability 
of choosing roots for k iterations would be at most (1/100 )、 

To formalize this, we introduce the notion of independence. 


Definition 1.3: Two events E and F are independent if and only if 


Vv(E n F) = Pr(£) - Pr(F). 


More generally, events E\, E 2 , ■ ■ ■, are mutually independent if and only if, for any 
subset I c [\,k]. 



=n pr (&) 

ief 


If our algorithm samples with replacement then in each iteration the algorithm chooses 
a random number uniformly at random from the set {1,, 100<i}, and thus the choice in 
one iteration is independent of the choices in previous iterations. For the case where the 
polynomials are not equivalent, let E { be the event that, on the ith run of the algorithm, 
we choose a root r, such that F(r ( ) — G(r,) = 0. The probability that the algorithm 
returns the wrong answer is given by 

Pr(£, n n •. • n E k ). 


Since Pr(Z?, ) is at most J/100J and since the events E\, E 2 ,..., E k are independent, 
the probability that the algorithm gives the wrong answer after k iterations is 


Pr(£i n £ 2 n • . . n £() = ]~[Pr(£ ; ) < Y\ 


d 

uxw 



The probability of making an error is therefore at most exponentially small in the num¬ 
ber of trials. 

Now let us consider the case where sampling is done without replacement. In this 
case the probability of choosing a given number is conditioned on the events of the 
previous iterations. 


Definition 1.4: The conditional probability that event E occurs given that event F 


occurs is 


Pr(£ I F)= 


Pr(£ fl F) 
Pv(F) 


The conditional probability is well-defined only if Pr(F) > 0. 


Intuitively, we are looking for the probability of E 0 F within the set of events defined 
by F. Because F defines our restricted sample space, we normalize the probabilities 
by dividing by Pr(/ r ), so that the sum of the probabilities of all events is 1. When 
Pr(F) > 0, the definition can also be written in the useful form 


Pr(£ I F)Pr(F) = Pr(£ fl F). 
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Notice that, when E and F are independent and Pr(F) / 0, we have 


Pr(Zs I F )= 


Pr(£ n Z 7 ) 

~yr{F)~ = 


Pr(£") Pr(F) 


= Pr(£). 


This is a property that conditional probability should have; intuitively, if two events are 
independent, then information about one event should not affect the probability of the 
、econd event. 

Again assume that we repeat the algorithm k times and that the input polynomials 
are not equivalent. What is the probability that in all the k iterations our random sam¬ 
pling from the set {1 ， ... ， 100J} yields roots of the polynomial F(x) — G(x), resulting 
in a wrong output by the algorithm? 

As in the analysis with replacement, we let Ej be the event that the random num¬ 
ber r, chosen in the ith iteration of the algorithm is a root of F(x) — G(x); again, the 
probability that the algorithm returns the wrong answer is given by 


Pr(£i Pi £2 门 ...n Ek). 


Applying the definition of conditional probability, we obtain 

Pr(£, n £ 2 n ... n £ 人 . ） = Pr(E k I & n £ 2 n •.. n E k . x ) -Pr(£i n 门 • • • 门 E k ^), 

and repeating this argument gives 

Pr(£i n £2 n … n £\') 

=Pr(£i) ■ Pr(E 2 \ E{) ■ Pr(£ 3 | n £ 2 ) . . . Pr (£々 | n £ 2 n ... n E k _\). 


Can we bound Pr(£y | n E 2 H ... H £}_i)? Recall that there are at most d values 
r for which F(r) — G(r) = 0; if trials 1 through j — \ < d have found 7 — 1 of them, 
ihen when sampling without replacement there are only d — (j — values out of the 
100c/ — (j — 1) remaining choices for which F(r) — G(r) = 0. Hence 


Pr (Ej j n £2 n ... n /Ty—1) s 


d-(j- 1 ) 
\ood-(j - 1 ) 


and the probability that the algorithm gives the wrong answer after k < d iterations is 
bounded by 


k 

Pr(£ , , n £ 2 n … n £\') $ 口 


md - (j - 1 ) 



Because (t/ ——1))/( 100t/—(_/ —1)) < d/\00d when j > 1, our bounds on the prob¬ 
ability of making an error are actually slightly better without replacement. You may 
also notice that, if we take d + 1 samples without replacement and the two polynomi¬ 
als are not equivalent, then we are guaranteed to find an r such that F(r) — G(r) ^ 0. 
Thus, in t/+ 1 iterations we are guaranteed to output the correct answer. However, com¬ 
puting the value of the polynomial at J + 1 points takes Q(d 2 ) time using the standard 
approach, which is no faster than finding the canonical form deterministically. 

Since sampling without replacement appears to give better bounds on the probabil¬ 
ity of error, why would we ever want to consider sampling with replacement? In some 
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cases, sampling with replacement is significantly easier to analyze, so it may be worth 
considering for theoretical reasons. In practice, sampling with replacement is often 
simpler to code and the effect on the probability of making an error is almost negligi¬ 
ble, making it a desirable alternative. 


1.3. Application: Verifying Matrix Multiplication 


We now consider another example where randomness can be used to verify an equal¬ 
ity more quickly than the known deterministic algorithms. Suppose we are given three 
n x ii matrices A, B, and C. For convenience, assume we are working over the integers 
modulo 2. We want to verify whether 


AB = C. 

One way to accomplish this is to multiply A and B and compare the result to C. The 
simple matrix multiplication algorithm takes @(« 3 ) operations. There exist more so¬ 
phisticated algorithms that are known to take roughly 0 (/? 2 . 37 ) operations. 

Once again, we use a randomized algorithm that allows for faster verification - at the 
expense of possibly returning a wrong answer with small probability. The algorithm is 
similar in spirit to our randomized algorithm for checking polynomial identities. The 
algorithm chooses a random vector r = (ri, r 2 ,..., r 7! ) e {0, 1} /7 . It then computes ABr 
by first computing Br and then A(Br), and it also computes Cr. If A(Br) ^ Cr, then 
AB C. Otherwise, it returns that AB = C. 

The algorithm requires three matrix-vector multiplications, which can be done in 
time 0(w 2 ) in the obvious way. The probability that the algorithm returns that AB = 
C when they are actually not equal is bounded by the following theorem. 


Theorem 1.4: //"AB / C and if ? is chosen uniformly at random from {0, l} 11 , then 
Pr(ABr = Cr) < 

Proof: Before beginning, we point out that the sample space for the vector r is the set 
{0, \ }" and that the event under consideration is ABr = Cr. We also make note of the 
following simple but useful lemma. 


Lemma 1.5: Choosing r = (r i, 尸 2 ， r„) e {0, 1 }' ! uniformly at random is equivalent 

to choosing each independently and uniformly from {0,1}. 


Proof: If each r, is chosen independently and uniformly at random, then each of the 
2 〃 possible vectors r is chosen with probability 2~ n , giving the lemma. □ 

Let D = AB — C ^ 0. Then ABr = Cr implies that Dr — 0. Since D / 0 it must 
have some nonzero entry; without loss of generality, let that entry be d\\. 


8 
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For Dr = 0, it must be the case that 

n 

J2 di J r J = 0 

./ = ! 

or, equivalently, 

E；=2 dun 

r\ = - : ~ :- 


( 1 . 1 ) 


Now we introduce a helpful idea. Instead of reasoning about the vector r, suppose 
that we choose the q independently and uniformly at random from {0,1} in order, from 
r n down to r ； . Lemma 1.5 says that choosing the in this way is equivalent to choos¬ 
ing a vector r uniformly at random. Now consider the situation just before r x is chosen. 
At this point, the right-hand side of Eqn. (1.1) is determined, and there is at most one 
choice for q that will make that equality hold. Since there are two choices for n，the 
equality holds with probability at most 1 / 2 , and hence the probability that ABr = Cr 
is at most 1 / 2 . By considering all variables besides r\ as having been set, we have re¬ 
duced the sample space to the set of two values { 0 ,1} for r \ and have changed the event 
being considered to whether Eqn. ( 1 . 1 ) holds. 

This idea is called the principle of deferred decisions. When there are several ran¬ 
dom variables, such as the 厂 / of the vector r, it often helps to think of some of them 
as being set at one point in the algorithm with the rest of them being left random - or 
deferred - until some further point in the analysis. Formally, this corresponds to con¬ 
ditioning on the revealed values; when some of the random variables are revealed, we 
must condition on the revealed values for the rest of the analysis. We will see further 
examples of the principle of deferred decisions later in the book. 

To formalize this argument, we first introduce a simple fact, known as the law of 
total probability. 

Theorem 1.6 [Law of Total Probability]: Let E\, Ei,..., E n be mutually disjoint 
events in the sample space Q, and let [Jf=i Ei = Q. Then 

n U 

Pt(B) = ^Pr(5 n Ei) = ^Pr(5 | Ei)Vv(Ei). 

i=\ i=\ 

Proof: Since the events B (1 E,- (i = . ,,n) are disjoint and cover the entire sample 

space it follows that 


Pr(5) = ^Pr(5 n Ei). 

i~\ 

Further, 

fl tl 

y^Pr(g n E。= y^ Pr(B | E i )Pr(E i ) 

/-I 

by the definition of conditional probability. 
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Now，using this law and summing over all collections of values (X 2 ,X 3 ,X 4 , .. .,x n ) e 
{0,1「— 1 yields 


Pr(ABr = Cr 


^ Pr((ABP = CF) fl ((r 2 , ..., r /? ) = (x 2 ,.. .,x n ))) 

v, : )€{ 0 . 1| , '- 1 

dn 


< E Pr (卜 i 

丨 -V:......v„)6{0.1}«-' W 


n (("2 ， ... ，厂 = (jf2, . . ., X n )) 


( X" 11 d\ r \ 

r\ = - ^~— j ■ Pr((r 2 ,. ..,r n ) = (x 2 ,...,x n )) 

(-V ： .v„)€iU, 11 

V- 1 

5 2^ -Pr((r 2 ,...,r n ) = (x 2 ,...,x n )) 

(A* 2......V„)6{0,1}«-' 


Here we have used the independence of r\ and (r 2 ,... ,r n ) in the fourth line. 


To improve on the error probability of Theorem 1.4, we can again use the fact that the 
algorithm has a one-sided error and run the algorithm multiple times. If we ever find 
an r such that ABr ^ Cr, then the algorithm will correctly return that AB ^ C. If we 
always find ABr = Cr, then the algorithm returns that AB = C and there is some 
probability of a mistake. Choosing r with replacement from {0, l} n for each trial, we 
obtain that, after k trials, the probability of error is at most 2- k . Repeated trials increase 
the running time to &(kn 2 ). 

Suppose we attempt this verification 100 times. The running time of the random¬ 
ized checking algorithm is still Q(n 2 ), which is faster than the known deterministic 
algorithms for matrix multiplication for sufficiently large n. The probability that an in¬ 
correct algorithm passes the verification test 100 times is 2 _l()0 , an astronomically small 
number. In practice, the computer is much more likely to crash during the execution 
of the algorithm than to return a wrong answer. 

An interesting related problem is to evaluate the gradual change in our confidence in 
the correctness of the matrix multiplication as we repeat the randomized test. Toward 
that end we introduce Bayes’ law. 


Theorem 1.7 [Bayes’ Law]: Assume that E\, E 2 ,..., E n are mutually disjoint sets 
such that ULi = E. Then 

Pr(F , n 的 — Vr{B\E J )WE ] ) 

( 7 ' } ~ Pr(5) - E ； = ,Pr(5|^)Pr(^) - 

As a simple application of Bayes’ law, consider the following problem. We are given 
three coins and are told that two of the coins are fair and the third coin is biased, land¬ 
ing heads with probability 2/3. We are not told which of the three coins is biased. We 

10 
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permute the coins randomly, and then flip each of the coins. The first and second coins 
come up heads, and the third comes up tails. What is the probability that the first coin 
is the biased one? 

The coins are in a random order and so, before our observing the outcomes of the 
coin flips, each of the three coins is equally likely to be the biased one. Let E, be the 
event that the /th coin flipped is the biased one, and let B be the event that the three 
coin flips came up heads, heads, and tails. 

Before we flip the coins we have Pr(Zs ? ) = 1/3 for all i. We can also compute the 
probability of the event B conditioned on 


2 11 1 

Pr(S I £,) = Pr(S I = - =— ， 

1 ' 3 2 2 6 

and 


Pr(S I E 3 ) = 1 -- 


12 


Applying Bayes’ law, we have 


Pr(£, I B )= 


Pr (召 I £,) Pr(£,) _ 2 

ELiPr(5 I ^)Pr(£ ; ) _ 5' 


Thus, the outcome of the three coin flips increases the likelihood that the first coin is 
the biased one from 1/3 to 2/5. 

Returning now to our randomized matrix multiplication test, we want to evaluate 
the increase in confidence in the matrix identity obtained through repeated tests. In 
the Bayesian approach one starts with a prior model, giving some initial value to the 
model parameters. This model is then modified, by incorporating new observations, to 
obtain a posterior model that captures the new information. 

In the matrix multiplication case, if we have no information about the process that 
generated the identity then a reasonable prior assumption is that the identity is correct 
with probability 1/2. If we run the randomized test once and it returns that the matrix 
identity is correct, how does this change our confidence in the identity? 

Let E be the event that the identity is correct, and let B be the event that the test re¬ 
turns that the identity is correct. We start with Pr(f) = Pr(J^) = 1/2, and since the 
test has a one-sided error bounded by 1/2, we have Pr(5 | £) = 1 and Pr(5 | E) < 
1/2. Applying Bayes’ law yields 


Pr(£ I B) = - Pr(g I E)Pr(E) — - ^ > 

Pr(5 I E) Pr(£) + Pr(5 | E) Pr(E) ~ 


1/2 2 
1/2+ 1/2- 1/2 = 3 


Assume now that we run the randomized test again and it again returns that the iden¬ 
tity is correct. After the first test, I may naturally have revised my prior model, so that 
I believe Pr(£) >2/3 and Pr(£) < 1/3. Now let B be the event that the new test 
returns that the identity is correct; since the tests are independent, as before we have 
?r(B \ E) = \ and Pr(5 \ E) < 1/2. Applying Bayes’ law then yields 


Pr(£ I B) > 


2/3 

2/3 + 1/3 . 1/2 


4 
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In general: If our prior model (before running the test) is that Pr(£) > T / {T + 1) 
and if the test returns that the identity is correct (event B ')，then 


2 ? 


Pr(E I B) > 



27+1 


Thus, if all 100 calls to the matrix identity test return that the identity is correct, our 
confidence in the correctness of this identity is at least 1 — 1 /(2 100 + 1). 


1.4. Application: A Randomized Min-Cut Algorithm 


A cut-set in a graph is a set of edges whose removal breaks the graph into two or 
more connected components. Given a graph G — (V. E) with n vertices, the minimum 
cut - or min-cut - problem is to find a minimum cardinality cut-set in G. Minimum 
cut problems arise in many contexts, including the study of network reliability. In the 
case where nodes correspond to machines in the network and edges correspond to con¬ 
nections between machines, the min-cut is the smallest number of edges that can fail 
before some pair of machines cannot communicate. Minimum cuts also arise in clus¬ 
tering problems. For example, if nodes represent Web pages (or any documents in a 
hypertext-based system) and two nodes have an edge between them if the correspond¬ 
ing nodes have a hyperlink between them, then small cuts divide the graph into clusters 
of documents with few links between clusters. Documents in different clusters are 
likely to be unrelated. 

We shall proceed by making use of the definitions and techniques presented so far in 
order to analyze a simple randomized algorithm for the min-cut problem. The main op¬ 
eration in the algorithm is edge contraction. In contracting an edge {u, u} we merge the 
two vertices u and v into one vertex, eliminate all edges connecting u and v, and retain 
all other edges in the graph. The new graph may have parallel edges but no self-loops. 
Examples appear in Figure 1.1, where in each step the dark edge is being contracted. 

The algorithm consists of n — 2 iterations. In each iteration, the algorithm picks an 
edge from the existing edges in the graph and contracts that edge. There are many pos¬ 
sible ways one could choose the edge at each step. Our randomized algorithm chooses 
the edge uniformly at random from the remaining edges. 

Each iteration reduces the number of vertices in the graph by one. After n — 2 it¬ 
erations, the graph consists of two vertices. The algorithm outputs the set of edges 
connecting the two remaining vertices. 

It is easy to verify that any cut-set of a graph in an intermediate iteration of the algo¬ 
rithm is also a cut-set of the original graph. On the other hand, not every cut-set of the 
original graph is a cut-set of a graph in an intermediate iteration, since some edges of 
the cut-set may have been contracted in previous iterations. As a result, the output of 
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2 4 2 2 


(a) A successful run of min-cut. 



2 4 2 2 

(b) An unsuccessful run of min-cut. 

Figure 1.1: An example of two executions of min-cut in a graph with minimum cut-set of size 2. 

We now establish a lower bound on the probability that the algorithm returns a cor¬ 
rect output. 

Theorem 1.8: The algorithm outputs a min-cut set with probability at least 2/n(n — 1). 

Proof: Let k be the size of the min-cut set of G. The graph may have several cut-sets 
of minimum size. We compute the probability of finding one specific such set C. 

Since C is a cut-set in the graph, removal of the set C partitions the set of vertices 
into two sets, S and V — S, such that there are no edges connecting vertices in S to 
vertices in V - 5. Assume that, throughout an execution of the algorithm, we contract 
only edges that connect two vertices in S or two vertices in V 7 - 5, but not edges in C. 
In that case, all the edges eliminated throughout the execution will be edges connect¬ 
ing vertices in S or vertices in V — 5 1 , and after n —2 iterations the algorithm returns a 
graph with two vertices connected by the edges in C. We may therefore conclude that, 
if the algorithm never chooses an edge of C in its n — 2 iterations, then the algorithm 
returns C as the minimum cut-set. 

This argument gives some intuition for why we choose the edge at each iteration 
uniformly at random from the remaining existing edges. If the size of the cut C is small 
and if the algorithm chooses the edge uniformly at each step, then the probability that 
the algorithm chooses an edge of C is small - at least when the number of edges re¬ 
maining is large compared to C. 

Let Ej be the event that the edge contracted in iteration i is not in C, and let F, = 
Pl^i Ej be the event that no edge of C was contracted in the first / iterations. We need 
to compute Pr(F H _ 2 ). 

We start by computing Pr(Z?i) = PrC/ 7 ]). Since the minimum cut-set has k edges, 
all vertices in the graph must have degree k or larger. If each vertex is adjacent to at 
least k edges, then the graph must have at least nk/2 edges. The first contracted edge 
is chosen uniformly at random from the set of all edges. Since there are at least nk/2 
edges in the graph and since C has k edges, the probability that we do not choose an 
edge of C in the first iteration is given by 


13 




EVENTS AND PROBABILITY 


n(n — 1) 

Since the algorithm has a one-sided error, we can reduce the error probability by repeat¬ 
ing the algorithm. Assume that we run the randomized min-cut algorithm n(n — 1) Inn 
times and output the minimum size cut-set found in all the iterations. The probability 
that the output is not a min-cut set is bounded by 

—1) In// I 

<e- 21n » = ^. 

- n 2 

In the first inequality we have used the fact that 1 — x < e _A . 

1.5. Exercises 

Exercise 1.1: We flip a fair coin ten times. Find the probability of the following events. 

(a) The number of heads and the number of tails are equal. 

(b) There are more heads than tails. 

(c) The /th flip and the (11 — /)th flip are the same for i = 1__ 5. 

(d) We flip at least four consecutive heads. 



2k 2 

Pr(^i) = Pr(Fi) > 1 —— r = 1 —— . 

nk n 

Let us suppose that the first contraction did not eliminate an edge of C. In other 
words, we condition on the event F\. Then, after the first iteration, we are left with an 
(n — \ )-node graph with minimum cut-set of size k. Again, the degree of each vertex in 
the graph must be at least k, and the graph must have at least k(n — 1)/2 edges. Thus, 


Similarly. 


Pr(£ 2 I FO > 1 - 


Pr (^ ； I Fi-0 > 


2 


k(n - 1)/2 


k(n — i + 1)/2 n — i + \ 

To compute Pr(/ r w _ 2 ), we use 

Pr(F w _ 2 ) = Pr (£„_ 2 n F n _ 3 ) = Pr(E n 2 I ^- 3 ) ' Pr(F„_ 3 ) 

=Pr (^«-2 I F„_ 3 ) . Pr(£,,_ 3 I H. Pt(E 2 \ F]) ■ PrC/^) 

/ nYi - 2 、力… - 1 、 


+ i 
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Exercise 1.2: We roll two standard six-sided dice. Find the probability of the follow¬ 
ing events, assuming that the outcomes of the rolls are independent. 

(a) The two dice show the same number. 

(b) The number that appears on the first die is larger than the number on the second. 

(c) The sum of the dice is even. 

(d) The product of the dice is a perfect square. 

Exercise 1.3: We shuffle a standard deck of cards, obtaining a permutation that is uni¬ 
form over all 52! possible permutations. Find the probability of the following events. 

(a) The first two cards include at least one ace. 

(b) The first five cards include at least one ace. 

(c) The first two cards are a pair of the same rank. 

(d) The first five cards are all diamonds. 

<e) The first five cards form a full house (three of one rank and two of another rank). 


Exercise 1.4: We are playing a tournament in which we stop as soon as one of us wins 
n games. We are evenly matched, so each of us wins any game with probability 1/2, 
independently of other games. What is the probability that the loser has won k games 
when the match is over? 

Exercise 1.5: After lunch one day, Alice suggests to Bob the following method to de¬ 
termine who pays. Alice pulls three six-sided dice from her pocket. These dice are not 
the standard dice, but have the following numbers on their faces: 

• die A - 1 ， 1 ，6 ,6, 8, 8; 

• die B-2, 2, 4, 4, 9, 9; 

• die C-3, 3, 5, 5, 7, 7. 

The dice are fair, so each side comes up with equal probability. Alice explains that 
Alice and Bob will each pick up one of the dice. They will each roll their die, and the 
one who rolls the lowest number loses and will buy lunch. So as to take no advantage, 
Alice offers Bob the first choice of the dice. 


(a) Suppose that Bob chooses die A and Alice chooses die B. Write out all of the pos¬ 
sible events and their probabilities, and show that the probability that Alice wins 
is greater than 1/2. 

(b) Suppose that Bob chooses die B and Alice chooses die C. Write out all of the pos¬ 
sible events and their probabilities, and show that the probability that Alice wins 
is greater than 1/2. 

(c) Since die A and die B lead to situations in Alice’s favor, it would seem that Bob 
should choose die C. Suppose that Bob does choose die C and Alice chooses die A. 
Write out all of the possible events and their probabilities, and show that the prob¬ 
ability that Alice wins is still greater than 1/2. 
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EVENTS AND PROBABILITY 


Exercise 1.6: Consider the following balls-and-bin game. We start with one black ball 
and one white ball in a bin. We repeatedly do the following: choose one ball from the 
bin uniformly at random, and then put the ball back in the bin with another ball of the 
same color. We repeat until there are n balls in the bin. Show that the number of white 
balls is equally likely to be any number between 1 and n — \. 


Exercise 1.7: (a) Prove Lemma 3, the inclusion-exclusion principle, 
(b) Prove that, when l is odd, 


p<U 屮 E Pr(£,) - ^ Pr(£, n E/) 

! = \ / =1 i < j 

+ 二 Pr(E,. n £/■ n A) 

i< j<k 

—— + (-i, +l J2 Pr (仏 n ... n £, v )_ 
(c) Prove that, when i is even, 


Pr 



> J]Pr(E ( )-^Pr(£ ( n^) 



+ J2 Pr(E, n ^ n A) 

i< j<k 


—— 十 （ _1) €+1 J2 Pr(^, 


Exercise 1.8: I choose a number uniformly at random from the range [1,1,000,000]. 
Using the inclusion-exclusion principle, determine the probability that the number cho¬ 
sen is divisible by one or more of 4, 6, and 9. 

Exercise 1.9: Suppose that a fair coin is flipped n times. For k > 0, find an upper 
bound on the probability that there is a sequence of log 2 n + k consecutive heads. 


Exercise 1.10: I have a fair coin and a two-headed coin. I choose one of the two coins 
randomly with equal probability and flip it. Given that the flip was heads, what is the 
probability that I flipped the two-headed coin? 

Exercise 1.11: I am trying to send you a single bit, either a 0 or a 1. When I transmit 
the bit, it goes through a series of n relays before it arrives to you. Each relay flips the 
bit independently with probability p. 


(a) Argue that the probability you receive the correct bit is 
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(b) We consider an alternative way to calculate this probability. Let us say the relay 
has bias q if the probability it flips the bit is (1 — q)/2. The bias q is therefore a 
real number in the range [—1,1]. Prove that sending a bit through two relays with 
bias q\ and 仍 is equivalent to sending a bit through a single relay with bias q\q 2 . 

(c) Prove that the probability you receive the correct bit when it passes through n re¬ 
lays as described before (a) is 


1 + ( 2 /?- 1 )” 

2 . 

Exercise 1.12: The following problem is known as the Monty Hall problem, after the 
host of the game show “Let’s Make a Deal”. There are three curtains. Behind one 
curtain is a new car, and behind the other two are goats. The game is played as fol¬ 
lows. The contestant chooses the curtain that she thinks the car is behind. Monty then 
opens one of the other curtains to show a goat. (Monty may have more than one goat 
to choose from; in this case, assume he chooses which goat to show uniformly at ran¬ 
dom.) The contestant can then stay with the curtain she originally chose or switch to 
the other unopened curtain. After that, the location of the car is revealed, and the con¬ 
testant wins the car or the remaining goat. Should the contestant switch curtains or not, 
or does it make no difference? 


Exercise 1.13: A medical company touts its new test for a certain genetic disorder. 
The false negative rate is small: if you have the disorder, the probability that the test 
returns a positive result is 0.999. The false positive rate is also small: if you do not 
have the disorder, the probability that the test returns a positive result is only 0.005. 
Assume that 2% of the population has the disorder. If a person chosen uniformly from 
the population is tested and the result comes back positive, what is the probability that 
the person has the disorder? 


Exercise 1.14: I am playing in a racquetball tournament, and I am up against a player 
I have watched but never played before. I consider three possibilities for my prior 
model: we are equally talented, and each of us is equally likely to win each game; I 
am slightly better, and therefore I win each game independently with probability 0.6; 
or he is slightly better, and thus he wins each game independently with probability 0.6. 
Before we play, I think that each of these three possibilities is equally likely. 

In our match we play until one player wins three games. I win the second game, but 
he wins the first, third, and fourth. After this match, in my posterior model, with what 
probability should I believe that my opponent is slightly better than I am? 


Exercise 1.15: Suppose that we roll ten standard six-sided dice. What is the probabil¬ 
ity that their sum will be divisible by 6, assuming that the rolls are independent? (Hint: 
Use the principle of deferred decisions, and consider the situation after rolling all but 
one of the dice.) 

Exercise 1.16: Consider the following game, played with three standard six-sided dice. 
If the player ends with all three dice showing the same number, she wins. The player 
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starts by rolling all three dice. After this first roll, the player can select any one, two, 
or all of the three dice and re-roll them. After this second roll, the player can again se¬ 
lect any of the three dice and re-roll them one final time. For questions (a)-(d), assume 
that the player uses the following optimal strategy: if all three dice match, the player 
stops and wins; if two dice match, the player re-rolls the die that does not match; and 
if no dice match, the player re-rolls them all. 


(a) Find the probability that all three dice show the same number on the first roll. 

(b) Find the probability that exactly two of the three dice show the same number on 
the first roll. 

(c) Find the probability that the player wins, conditioned on exactly two of the three 
dice showing the same number on the first roll. 

(d) By considering all possible sequences of rolls, find the probability that the player 
wins the game. 

Exercise 1.17: In our matrix multiplication algorithm, we worked over the integers 
modulo 2. Explain how the analysis would change if we worked over the integers mod¬ 
ulo kfork > 2. 


Exercise 1.18: We have a function F : {0, — 1} ^ {0, ..., m — 1}. We know that, 
for 0 < x, y < n — \ , F((x + v) mod,?) = (F(x) + F(y)) mod m . The only way we 
have for evaluating F is to use a lookup table that stores the values of F. Unfortunately, 
an Evil Adversary has changed the value of 1/5 of the table entries when we were not 
looking. 

Describe a simple randomized algorithm that, given an input outputs a value that 
equals F(z) with probability at least 1/2. Your algorithm should work for every value 
of z, regardless of what values the Adversary changed. Your algorithm should use as 
few lookups and as little computation as possible. 

Suppose I allow you to repeat your initial algorithm three times. What should you 
do in this case, and what is the probability that your enhanced algorithm returns the 
correct answer? 


Exercise 1.19: Give examples of events where Pr(A | B) < Pr(A), Pr(A | B )= 
Pr(A), and Pr(A | B) > Pr(A). 

Exercise 1.20: Show that, if E\, E n are mutually independent, then so are 

E j ^ E 2 ， • • ” Ei f / • 


Exercise 1.21: Give an example of three random events X, K Z for which any pair are 
independent but all three are not mutually independent. 


Exercise 1.22: (a) Consider the set { 1 ,..., n}. We generate a subset X of this set as fol¬ 
lows: a fair coin is flipped independently for each element of the set; if the coin lands 
heads then the element is added to X, and otherwise it is not. Argue that the resulting 
set X is equally likely to be any one of the 2 〃 possible subsets. 
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(b) Suppose that two sets X and Y are chosen independently and uniformly at ran¬ 
dom from all the 2 U subsets of {1,..., n}. Determine Pr(X c Y) and Pr(X U F = 
11_ ,n}). (Hint: Use the part (a) of this problem.) 

Exercise 1.23: There may be several different min-cut sets in a graph. Using the analy¬ 
sis of the randomized min-cut algorithm, argue that there can be at most n(n — 1)/2 
distinct min-cut sets. 

Exercise 1.24: Generalizing on the notion of a cut-set, we define an r-way cut-set in a 
graph as a set of edges whose removal breaks the graph into r or more connected com¬ 
ponents. Explain how the randomized min-cut algorithm can be used to find minimum 
/■-way cut-sets, and bound the probability that it succeeds in one iteration. 


Exercise 1.25: To improve the probability of success of the randomized min-cut algo¬ 
rithm, it can be run multiple times. 

(a) Consider running the algorithm twice. Determine the number of edge contractions 
and bound the probability of finding a min-cut. 

I b) Consider the following variation. Starting with a graph with n vertices, first con¬ 
tract the graph down to k vertices using the randomized min-cut algorithm. Make 
copies of the graph with k vertices, and now run the randomized algorithm on this 
reduced graph l times, independently. Determine the number of edge contractions 
and bound the probability of finding a minimum cut. 

(c) Find optimal (or at least near-optimal) values of k and t for the variation in (b) that 
maximize the probability of finding a minimum cut while using the same number 
of edge contractions as running the original algorithm twice. 

Exercise 1.26: Tic-tac-toe always ends up in a tie if players play optimally. Instead, 
we may consider random variations of tic-tac-toe. 

(a) First variation: Each of the nine squares is labeled either X or O according to an 
independent and uniform coin flip. If only one of the players has one (or more) 
winning tic-tac-toe combinations, that player wins. Otherwise, the game is a tie. 
Determine the probability that X wins. (You may want to use a computer program 
to help run through the configurations.) 

(b) Second variation: X and O take turns, with the X player going first. On the X 
player’s turn, an X is placed on a square chosen independently and uniformly at 
random from the squares that are still vacant; O plays similarly. The first player 
to have a winning tic-tac-toe combination wins the game, and a tie occurs if nei¬ 
ther player achieves a winning combination. Find the probability that each player 
wins. (Again, you may want to write a program to help you.) 
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CHAPTER TWO 

Discrete Random Variables 
and Expectation 


In this chapter, we introduce the concepts of discrete random variables and expectation 
and then develop basic techniques for analyzing the expected performance of algo¬ 
rithms. We apply these techniques to computing the expected running time of the well- 
known Quicksort algorithm. In analyzing two versions of Quicksort, we demonstrate 
the distinction between the analysis of randomized algorithms, where the probability 
space is defined by the random choices made by the algorithm, and the probabilistic 
analysis of deterministic algorithms, where the probability space is defined by some 
probability distribution on the inputs. 

Along the way we define the Bernoulli, binomial, and geometric random variables, 
study the expected size of a simple branching process, and analyze the expectation of 
the coupon collector's problem - a probabilistic paradigm that reappears throughout 
the book. 

2.1. Random Variables and Expectation 

When studying a random event, we are often interested in some value associated with 
the random event rather than in the event itself. For example, in tossing two dice we 
are often interested in the sum of the two dice rather than the separate value of each die. 
The sample space in tossing two dice consists of 36 events of equal probability, given 
by the ordered pairs of numbers {(1.1), (1, 2),..., (6, 5), (6,6)}. If the quantity we are 
interested in is the sum of the two dice, then we are interested in 11 events (of unequal 
probability): the 11 possible outcomes of the sum. Any such function from the sample 
space to the real numbers is called a random variable. 

Definition 2.1: A random variable X on a sample space ^ is a real-valued function 
on Q; that is, X: ^ ^ K. A discrete random variable is a random variable that takes 
on only a finite or countably infinite number of values. 


Since random variables are functions, they are usually denoted by a capital letter such 
as X or y, while real numbers are usually denoted by lowercase letters. 
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2.1 RANDOM VARIABLES AND EXPECTATION 


For a discrete random variable X and a real value a, the event U X = a y, includes all 
the basic events of the sample space in which the random variable X assumes the value 
a. That is, = a' represents the set G ^ | X(.y) = a). We denote the probability 
of that event by 


Pr(X = a) = ^2 PrO)_ 

.V ^ ! X( V ) = t/ 


If X is the random variable representing the sum of the two dice, then the event X = 4 
corresponds to the set of basic events {(1,3), (2,2), (3,1)}. Hence 


Pr(X =4) 


36 12 


The definition of independence that we developed for events extends to random 
variables. 


Definition 2.2: Two random variables X and Y are independent if and only if 


Pr((X =x)D(Y = v)) = Pr(X = x) - Pr(K = v) 

tor all values x and y. Similarly, random variables X 2 — , X k are mutually inde¬ 
pendent if and only if, for any subset I c [1,/:] and any values x t . i e /, 



A basic characteristic of a random variable is its expectation. The expectation of a 
random variable is a weighted average of the values it assumes, where each value is 
weighted by the probability that the variable assumes that value. 

Definition 2.3: The expectation of a discrete random variable X y denoted b\E[X], is 
i：iven by 


E[X] = J];Pr(X = /), 

i 

where the summation is over all values in the range of X. The expectation is finite if 
U,.| Pr(X = /) converges; otherwise, the expectation is unbounded. 


For example, the expectation of the random variable X representing the sum of two 
dice is 


E[X] 






36 2 + 36 


3+ 36' 4 + 


36 


• 12 = 7. 


Vou may try using symmetry to give simpler argument for why E[X] = 7. 

As an example of where the expectation of a discrete random variable is unbounded, 
consider a random variable X that takes on the value 2 1 with probability l/2 f for / = 
1.2.The expected value of X is 
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OC J 

E[X] = J2^l 2 ' = = OO. 

/ =1 ^ f=l 

Here we use the somewhat informal notation E[X] = oo to express that E[X] is 
unbounded. 

2.1.1. Linearity of Expectations 

A key property of expectation that significantly simplifies its computation is the linear¬ 
ity of expectations. By this property, the expectation of the sum of random variables is 
equal to the sum of their expectations. Formally, we have the following theorem. 

Theorem 2.1 [Linearity of Expectations]: For any finite collection of discrete ran¬ 
dom variables X\, X 2 ,..., X n with finite expectations, 

- n n 

-/=1 」 / =1 

Proof: We prove the statement for two random variables X and Y: the general case 
follows by induction. The summations that follow are understood to be over the ranges 
of the corresponding random variables: 

E[X + Y] = J2 !](，• + ./) Pr((x = /) n (y = ./)) 

i j 

二 Pr((X = /) fl CK = _/)) + ^ ^ ./• Pr((X = 0 fl (F = ./•)) 

i j ，’ j 

= E，E Pr((X = /)n(K = j)) + E./.E Pr((x = i) n (y =./)) 

i J j i 

= J]/Pr(X = /) + ^./ Pr(K = j) 

i J 

= E[X] + E[K]. 

The first equality follows from Definition 1.2. In the penultimate equation we have used 
Theorem 1.6, the law of total probability. ■ 

We now use this property to compute the expected sum of two standard dice. Let X = 
X\ + X 2 , where X ； represents the outcome of die i for i = \, 2. Then 

J=i 

Applying the linearity of expectations, we have 

E[X] = E[X { ] + E[X 2 ] = 7. 
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It is worth emphasizing that linearity of expectations holds for any collection of 
random variables, even if they are not independent! For example, consider again the 
previous example and let the random variable Y = X\ + X^. We have 

E[Y] = E[X, + X, 2 ] = E[X|] + E[X, 2 ], 

even though X\ and Xf are clearly dependent. As an exercise, you may verify this 
identity by considering the six possible outcomes for X). 

Linearity of expectations also holds for countably infinite summations in certain 
cases. Specifically, it can be shown that 

- OC DC 

E =[E [ 又 ] 

-/=1 」 / =1 

whenever E[|X,-1] converges. The issue of dealing with the linearity of expecta¬ 
tions with countably infinite summations is further considered in Exercise 2.29. 

This chapter contains several examples in which the linearity of expectations signif¬ 
icantly simplifies the computation of expectations. One result related to the linearity 
of expectations is the following simple lemma. 

Lemma 2.2: For any constant c and discrete random variable X. 


E[cX] = rE[X], 


Proof: The lemma is obvious for c = 0. For c ^ 0, 

E[cX] = J2j^( cX = j) 

j 

=c^2( j/c) Pr(X = j/c) 

J 

= cj2 k?r ( x = k "> 

k 

^cE[X]. ■ 

2.1.2. Jensen’s Inequality 


Suppose that we choose the length X of a side of a square uniformly at random 
irom the range [1,99]. What is the expected value of the area? We can write this 
as E[X 2 ]. It is tempting to think of this as being equal to E[X] 2 , but a simple calcula¬ 
tion shows that this is not correct. In fact, E[X] 2 = 2500 whereas E[X 2 ] = 9950/3 > 
2500. 

More generally, we can prove thatE[X 2 ] > (E[X]) 2 . Consider Y = (X - E[X]) 2 . 
The random variable Y is nonnegative and hence its expectation must also be nonneg¬ 
ative. Therefore, 


23 




DISCRETE RANDOM VARIABLES AND EXPECTATION 


0<E[K] = E[(X-E[X]) 2 ] 

= E[X 2 -2XE[X] + (E[X]) 2 ] 

=E[X 2 ] - 2E[XE[XJ] + (E[X]) 2 
= E[X 2 ]~ (E[X\) 2 . 

To obtain the penultimate line, we used the linearity of expectations. To obtain the last 
line we used Lemma 2.2 to simplify E[XE[X]] = E[X] - E[X], 

The fact that E[X 2 ] > (E[X]) 2 is an example of a more general theorem known 
as Jensen’s inequality. Jensen's inequality shows that, for any convex function /, we 
haveE[/(X)] > /(E[X]). 

Definition 2.4: A function f : R R is said to be convex if, for any a'i, X 2 and 0 < 

^ < 1 . 

,/ (入 + (1 — 入)义2 ) S 入 ./ (-'- I ) + (1 - 入 )/ (义2 ). 

Visually, a convex function / has the property that, if you connect two points on the 
graph of the function by a straight line, this line lies on or above the graph of the func¬ 
tion. The following fact, which we state without proof, is often a useful alternative to 
Definition 2.4. 

Lemma 2.3: If f is a twice differentiable function, then f is convex if and onlv if 

/V) >0. ' ' ' ' ' 

Theorem 2.4 [Jensen’s Inequality]: Iff is a convex function, then 
E[f(X)]> f(E[X]). 

Proof: We prove the theorem assuming that / has a Taylor expansion. Let fi = E[X], 
By Taylor’s theorem there is a value c such that 

> /(M) + • 广 (mK-V - M), 

since f"(c) > 0 by convexity. Taking expectations of both sides and applying linearity 
of expectations and Lemma 2.2 yields the result: 

E[f(X)]>E\f(ii) ^ - fi)} 

=E[/( / x)] + /(/x)(E[X] — //) 

= /(") =/(Em). ■ 

An alternative proof of Jensen’s inequality, which holds for any random variable X that 
takes on only finitely many values, is presented in Exercise 2.10. 
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2.2 THE BERNOULLI AND BINOMIAL RANDOM VARIABLES 


2.2. The Bernoulli and Binomial Random Variables 


Suppose that we run an experiment that succeeds with probability p and fails with 
probability 1 — p. 

Let F be a random variable such that 

I I if the experiment succeeds, 

0 otherwise. 

The variable Y is called a Bernoulli or an indicator random variable. Note that, for a 
Bernoulli random variable, 


E[Y] = /? . 1 + (1 - /)) ‘ 0 = /) = Pr(y = 1). 


For example, if we flip a fair coin and consider the outcome "heads” a success, then 
the expected value of the corresponding indicator random variable is 1/2. 

Consider now a sequence of n independent coin flips. What is the distribution of the 
number of heads in the entire sequence? More generally, consider a sequence of n inde¬ 
pendent experiments, each of which succeeds with probability p. If we let X represent 
the number of successes in the n experiments, then X has a binomial distribution. 

Definition 2.5: A binomial random variable X with parameters n ami p. denoted by 
Bin, p)，is defined by the following probability distribution on j = 0. 1.2. n: 

Pr(X = ./•)=(’； 

That is, the binomial random variable X equals j when there are exactly / successes 
and n — j failures in n independent experiments, each of which is successful with prob¬ 
ability p. 

As an exercise, you should show that Definition 2.5 ensures that Pr( X = j)= 
1. This is necessary for the binomial random variable to be a valid probability function, 
according to Definition 1.2. 

The binomial random variable arises in many contexts, especially in sampling. As a 
practical example, suppose that we want to gather data about the packets going through 
a router by postprocessing them. We might want to know the approximate fraction of 
packets from a certain source or of a certain data type. We do not have the memory 
available to store all of the packets, so we choose to store a random subset - or sam¬ 
ple - of the packets for later analysis. If each packet is stored with probability p and 
if n packets go through the router each day, then the number of sampled packets each 
day is a binomial random variable X with parameters n and p. If we want to know how 
much memory is necessary for such a sample, a natural starting point is to determine 
the expectation of the random variable X. 

Sampling in this manner arises in other contexts as well. For example, by sampling 
the program counter while a program runs, one can determine what parts of a program 
are taking the most time. This knowledge can be used to aid dynamic program opti¬ 
mization techniques such as binary rewriting, where the executable binary form of a 
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where the last equation uses the binomial identity 



The linearity of expectations allows for a significantly simpler argument. If X is a 
binomial random variable with parameters n and p, then X is the number of successes 
in n trials, where each trial is successful with probability p. Define a set of n indicator 
random variables X n , where X ; = 1 if the ith trial is successful and 0 otherwise. 

Clearly, E[X ; ] = p and X = - ! =| X { and so, by the linearity of expectations, 

- n 

E[X] =E 

-/ =i 

The linearity of expectations makes this approach of representing a random variable by a 
sum of simpler random variables, such as indicator random variables, extremely useful. 

2.3. Conditional Expectation 

Just as ue have defined conditional probability, it is useful to define the conditional 
expectation of a random variable. The following definition is quite natural. 




program is modified while the program executes. Since rewriting the executable as the 
program runs is expensive, sampling helps the optimizer to determine when it will be 
worthwhile. 

What is the expectation of a binomial random variable X? We can compute it di¬ 
rectly from the definition as 
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2.3 CONDITIONAL EXPECTATION 


E[Y\Z = z,] = f) ， Pr(Y = .v | Z = z), 


where the summation is over all y in the range of Y. 

The definition states that the conditional expectation of a random variable is, like the 
expectation, a weighted sum of the values it assumes. The difference is that now each 
value is weighted by the conditional probability that the variable assumes that value. 

For example, suppose that we independently roll two standard six-sided dice. Let 
X\ be the number that shows on the first die, X 2 the number on the second die, and X 
the sum of the numbers on the two dice. Then 


E[X I Z, = 2] = J2 x?r( < X = x I X, = 2) - J2 X ' 

X A- = 3 

As another example, consider E[X] | X = 5]: 

4 

E[X, I X = 5] = ^xPr(X, =.v I X = 5) 


1 11 


E 

A - 二 1 

4 

E 


Pr( X) = n X = 5) 


Pr(X = 5) 


1/36 

4/36 


The following natural identity follows from Definition 2.6. 


Lemma 2.5: For any random variables X and Y, 

E[X] = J]Pr(F = v)E[X | Y = v], 

\' 

where the sum is over all values in the range of Y and all of the expectations exist. 


Proof: 

^Pr(F = y)E[X \ Y = y] = ^Pr(r = v) = .v | K - y) 

>' V -V 

— ^ ^ jc Pr(X = x I F — v) Pr( Y — y) 

X V 

=^ ^^ Pr(x = a- n r = v) 

.V -V 

— y^.y Pr(Z = x) 

X 

= E[Z], ■ 

27 



DISCRETE RANDOM VARIABLES AND EXPECTATION 


The linearity of expectations also extends to conditional expectations. This is clarified 
in Lemma 2.6, whose proof is left as Exercise 2.11. 


Lemma 2.6: For any finite collection of discrete random variables X\, X 2 , ■ ■ - ,X n with 
finite expectations and for any random variable Y, 

- n -1 n 

e \ Y = y =J2 E[Xi 1 r = .v]. 

L ,.=1 J ； = i 

Perhaps somewhat confusingly, the conditional expectation is also used to refer to the 
following random variable. 

Definition 2.7: The expression E[7 | Z] is a random variable f(Z) that takes on the 
value E[F \ Z = z] when Z — z. 


We emphasize that E[F | Z] is not a real value; it is actually a function of the ran¬ 
dom variable Z. Hence E[T | Z] is itself a function from the sample space to the real 
numbers and can therefore be thought of as a random variable. 

In the previous example of rolling two dice, 


X |+6 


E[X|X,] = ^xPr(Z = ^|Z,)= ^ 


: X|4-l 


• r '6 =Z| + 2- 


We see that E[X \ X]] is a random variable whose value depends on X x . 

If E[7 I Z] is a random variable, then it makes sense to consider its expectation 
E[E[F I Z]]. In our example, we found thatE[X | X,] = X, + 7/2. Thus 


EIE[X I X { ]} = E 




= 7 = E[X]. 


More generally, we have the following theorem. 


Theorem 2.7: 


E[F] =E[E[F I Z]]. 


Proof: From Definition 2.7 we have E[F | Z] = /(Z), where /(Z) takes on the value 
E[F I Z = z] when Z = z. Hence 

E[E[F I Z]] = J2 E[Y I Z = z] Pr (Z = z). 

The right-hand side equals E[y] by Lemma 2.5. ■ 

We now demonstrate an interesting application of conditional expectations. Consider 
a program that includes one call to a process S. Assume that each call to process S re¬ 
cursively spawns new copies of the process S, where the number of new copies is a 
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binomial random variable with parameters n and p. We assume that these random vari¬ 
ables are independent for each call to S. What is the expected number of copies of the 
process S generated by the program? 

To analyze this recursive spawning process, we introduce the idea of generations. 
The initial process S is in generation 0. Otherwise, we say that a process S is in gen¬ 
eration i if it was spawned by another process S in generation / — 1. Let Yj denote the 
number of S processes in generation i. Since we know that Yq = 1, the number of pro¬ 
cesses in generation 1 has a binomial distribution. Thus, 

E[Y ] ] = np. 


Similarly, suppose we knew that the number of processes in generation i — 1 was 
\j _I, so Yj-\ = Let Za be the number of copies spawned by the 々 th process 
spawned in the (z — l)th generation for 1 < ^ < v/_i. Each Z k is a binomial random 
variable with parameters n and p. Then 


Emx—m] = E 


I ^-i = >v-i 

k — \ 

) Pr( — j I ^-i — yi- 

= A 

/ >0 V k=\ 7 




V/-! 

= J]E[Z a .] 

A' = l 

= Vi-\np. 


In the third line we have used that the are all independent binomial random vari¬ 
ables; in particular, the value of each Z 々 is independent of ^ _ i, allowing us to remove 
the conditioning. In the fifth line, we have applied the linearity of expectations. 

Applying Theorem 2.7, we can compute the expected size of the ith generation in¬ 
ductively. We have 

E[^] = E[E[F ; . I U] = Ei^^np] = ^E[r,,]. 


By induction on i, and using the fact that F 0 = 1， we then obtain 

E[^] = (npY. 


The expected total number of copies of process S generated by the program is 
given by 
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- ；>o 」 / >0 i>() 

If np > \ then the expectation is unbounded; if np < 1, the expectation is 1/(1 — np). 
Thus, the expected number of processes generated by the program is bounded if and 
only if the expected number of processes spawned by each process is less than 1. 

The process analyzed here is a simple example of a branching process, a probabilis¬ 
tic paradigm extensively studied in probability theory. 


2.4. The Geometric Distribution 


Suppose that we flip a coin until it lands on heads. What is the distribution of the 
number of flips? This is an example of a geometric distribution, which arises in the 
following situation: we perform a sequence of independent trials until the first success, 
where each trial succeeds with probability p. 


Definition 2.8: A geometric random variable X with parameter p is given by the fol¬ 
lowing probability distribution on n = 1,2,...: 

?r(X = n) = (\- p) n ~ ] p. 

That is, for the geometric random variable X to equal there must be n — 1 failures, 
followed by a success. 

As an exercise, you should show that the geometric random variable satisfies 
J]Pr(X = n) = 1. 

n>\ 

Again, this is necessary for the geometric random variable to be a valid probability 
function, according to Definition 1.2. 

In the context of our example from Section 2.2 of sampling packets on a router, if 
packets are sampled with probability /), then the number of packets transmitted after 
the last sampled packet until and including the next sampled packet is given by a geo¬ 
metric random variable with parameter p. 

Geometric random variables are said to be memoryless because the probability that 
you will reach your first success n trials from now is independent of the number of fail¬ 
ures you have experienced. Informally, one can ignore past failures because they do 
not change the distribution of the number of future trials until first success. Formally, 
we have the following statement. 


Lemma 2.8: For a geometric random variable X with parameter p and for n > 0, 
Pr(X =n+k\X > k)= Pr(X - n). 
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Proof: 
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Pr(X = n+k \ X > k)= 


Pr((X = « + /:) f1(X > A:)) 
~Pr(X > k )~ 


Pr( X = n -\- k) 
= Pr(X > k) 


_ (i - py^-'p 

~ py p 

_ (i - ] p 

(i - p) k 
=(i - p)"~ ] p 


= Pr(X = n). 


The fourth equality uses the fact that, for 0 < .y < 1, v， = ' Vd — v). ■ 

We now turn to computing the expectation of a geometric random variable. When a 

random variable takes values in the set of natural numbers A* = {0.1.2.3_}, there 

is an alternative formula for calculating its expectation. 

Lemma 2.9: Let X be a discrete random variable that takes on only nonne}>ati\ e inte¬ 
ger values. Then 

E[XJ = ^Pr(X>/). 


Proof: 

OL 1C OC 

J]Pr(X>i) = ^J]Pr(X = y) 

1 = 1 1 = 1 7=1 

j 

= YH2 ?riX = 

j=\ i=\ 

-x 

= ^ 7 Pr(X = 7) 
y=i 

= E[X]. 

The interchange of (possibly) infinite summations is justified, since the terms being 
xunimed are all nonnegative. ■ 


For a geometric random variable X with parameter p, 

Pr(X>/) = J](l- p) n ~ x p = (1- P y \ 
# 1=1 


Hence 
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oc 

E[X] = ^Pr(X >/) 


= E(]i 广 1 

!=1 


— 1 一 （1 一尸 ) 

1 

P' 

Thus, for a fair coin where p = 1/2, on average it takes two flips to see the first heads. 

There is another approach to finding the expectation of a geometric random vari¬ 
able X with parameter p - one that uses conditional expectations and the memory¬ 
less property of geometric random variables. Recall that X corresponds to the num¬ 
ber of flips until the first heads given that each flip is heads with probability p. Let 
y = 0 if the first flip is tails and 7 = 1 if the first flip is heads. By the identity from 
Lemma 2.5 ， 

E[X] = Pr(F = 0)E[Z I F = 0] + Pr(F = \)E[X | F = 1] 

=(i - P )E[x I r = 0] + pE[x I y = 1]. 

If F = 1 then Z = 1, so E[X | F = 1] = 1. If F = 0, then X > 1. In this case, let 
the number of remaining flips (after the first flip until the first heads) be Z. Then, by 
the linearity of expectations, 

E[X] = (1 - p)E[Z + 1] + p • 1 = (1 - p)E[Z] + l. 

By the memoryless property of geometric random variables, Z is also a geometric ran¬ 
dom variable with parameter p. Hence E[Z] = E[X], since they both have the same 
distribution. We therefore have 

E[X] = (1 - p)E[Z] + 1 = (1- p)E[X]^l 
which yields E[X] = \/p. 

This method of using conditional expectations to compute an expectation is often 
useful, especially in conjunction with the memoryless property of a geometric random 
variable. 


2.4.1. Example: Coupon Collector’s Problem 

The coupon collector’s problem arises from the following scenario. Suppose that each 
box of cereal contains one of n different coupons. Once you obtain one of every type 
of coupon, you can send in for a prize. Assuming that the coupon in each box is chosen 
independently and uniformly at random from the n possibilities and that you do not 
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collaborate with others to collect coupons, how many boxes of cereal must you buy 
before you obtain at least one of every type of coupon? This simple problem arises in 
many different scenarios and will reappear in several places in the book. 

Let X be the number of boxes bought until at least one of every type of coupon is 
obtained. We now determine E[X]. HX； is the number of boxes bought while you had 
exactly i — 1 different coupons, then clearly X = E;? =1 X t . 

The advantage of breaking the random variable X into a sum of n random variables 
X ( , i = 1， ••.，《，is that each X, is a geometric random variable. When exactly i — 1 
coupons have been found, the probability of obtaining a new coupon is 

i - 1 

A = 1 - 

n 

Hence, X ; is a geometric random variable with parameter /，,.，and 



Using the linearity of expectations, we have that 

- n 

E[X] = E 

- i=\ 
n 

= J]E[X ; ] 

/ ■二 1 
n 

= y^— 

The summation */ z ' ls known as the harmonic number H(n), and as we show 

next, H(n) = \nn + 0(1). Thus, for the coupon collector’s problem, the expected 
number of random coupons required to obtain all n coupons is nlnn + 0(/?). 

Lemma 2.10: The harmonic number H(n) = [d 1" satisfies H(n ) — In/? + 0(1). 


Proof: Since 1 /x is monotonically decreasing, we can write 


and 



-dx < 
x 




\nn. 
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(a) Approximating 1/x from above. (b) Approximating 1/x from below. 

Figure 2.1: Approximating the area below f(x) = \/x. 


This is clarified in Figure 2.1, where the area below the curve fix) = 1/x corre¬ 
sponds to the integral and the areas of the shaded regions correspond to the summations 

EL.1A andEL 2 lA- 

Hence In n < H{n) < Inn + 1, proving the claim. ■ 

As a simple application of the coupon collector's problem, suppose that packets are 
sent in a stream from a source host to a destination host along a fixed path of routers. 
The host at the destination would like to know which routers the stream of packets has 
passed through, in case it finds later that some router damaged packets that it processed. 
If there is enough room in the packet header, each router can append its identification 
number to the header, giving the path. Unfortunately, there may not be that much room 
available in the packet header. 

Suppose instead that each packet header has space for exactly one router identi¬ 
fication number, and this space is used to store the identification of a router chosen 
uniformly at random from all of the routers on the path. This can actually be accom¬ 
plished easily; we consider how in Exercise 2.18. Then, from the point of view of the 
destination host, determining all the routers on the path is like a coupon collector’s 
problem. If there are n routers along the path, then the expected number of packets in 
the stream that must arrive before the destination host knows all of the routers on the 
path is nH(n) " In" + 0(/?). 


2.5. Application: The Expected Run-Time of Quicksort 

Quicksort is a simple - and, in practice, very efficient — sorting algorithm. The input 
is a list of n numbers jci ， jc 2 ,_ x n . For convenience, we will assume that the num¬ 

bers are distinct. A call to the Quicksort function begins by choosing a pivot element 
from the set. Let us assume the pivot is x. The algorithm proceeds by comparing every 
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Quicksort Algorithm: 

Input: A list 5" = {xi,.. of n distinct elements over a totally ordered 
universe. 

Output: The elements of S in sorted order. 

1. If S has one or zero elements, return S. Otherwise continue. 

2. Choose an element of 5 as a pivot; call it .y. 

3. Compare every other element of S to .y in order to divide the other elements 
into two sublists: 

(a) S\ has all the elements of S that are less than 

(b) Si has all those that are greater than .v. 

4. Use Quicksort to sort S\ and S 2 - 

5. Return the list 5i,x, Si. 


Algorithm 2.1: Quicksort. 


other element to x, dividing the list of elements into two sublists: those that are less 
than x and those that are greater than x. Notice that if the comparisons are performed 
in the natural order, from left to right, then the order of the elements in each sublist is 
the same as in the initial list. Quicksort then recursively sorts these sublists. 

In the worst case. Quicksort requires ^2 (^ 2 ) comparison operations. For example, 
suppose our input has the form xi = n, X 2 = n — \ , , .v„_i = 2. = 1. Suppose 

also that we adopt the rule that the pivot should be the first element of the list. The 
first pivot chosen is then n , so Quicksort performs n — 1 comparisons. The division has 
yielded one sublist of size 0 (which requires no additional work) and another of size 
n — 1, with the order n — \,n — 2,_2,1. The next pivot chosen is n — \, so Quick¬ 

sort performs n — 2 comparisons and is left with one group of size n — 2 \n the order 
n — 2,/i — 3,_2,1. Continuing in this fashion, Quicksort performs 

n(n — 1) 

(/7 - 1) + (« — 2) 4 - • • • + 2 + 1 = - comparisons. 

This is not the only bad case that leads to Q(n 2 ) comparisons; similarly poor perfor¬ 
mance occurs if the pivot element is chosen from among the smallest few or the largest 
tew elements each time. 

We clearly made a bad choice of pivots for the given input. A reasonable choice of 
pivots would require many fewer comparisons. For example, if our pivot always split 
the list into two sublists of size at most \n/l], then the number of comparisons C{n) 
would obey the following recurrence relation: 


C(n) < 2C{\n/l\) + 0(/?). 

The solution to this equation yields C(n) = 0(n log/?), which is the best possible re- 
、ult for comparison-based sorting. In fact, any sequence of pivot elements that always 
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split the input list into two sublists each of size at least cn for some constant c would 
yield an 0(n \ogn) running time. 

This discussion provides some intuition for how we would like pivots to be chosen. 
In each iteration of the algorithm there is a good set of pivot elements that split the 
input list into two almost equal sublists; it suffices if the sizes of the two sublists are 
within a constant factor of each other. There is a also a bad set of pivot elements that do 
not split up the list significantly. If good pivots are chosen sufficiently often, Quicksort 
will terminate quickly. How can we guarantee that the algorithm chooses good pivot 
elements sufficiently often? We can resolve this problem in one of two ways. 

First, we can change the algorithm to choose the pivots randomly. This makes Quick¬ 
sort a randomized algorithm; the randomization makes it extremely unlikely that we 
repeatedly choose the wrong pivots. We demonstrate shortly that the expected number 
of comparisons made by a simple randomized Quicksort is 2n\nn -O(n), matching 
(up to constant factors) the ^2(^ \ogn) bound for comparison-based sorting. Here, the 
expectation is over the random choice of pivots. 

A second possibility is that we can keep our deterministic algorithm, using the first 
list element as a pivot, but consider a probabilistic model of the inputs. A permuta¬ 
tion of a set of n distinct items is just one of the n ! orderings of these items. Instead of 
looking for the worst possible input, we assume that the input items are given to us in 
a random order. This may be a reasonable assumption for some applications; alterna¬ 
tively, this could be accomplished by ordering the input list according to a randomly 
chosen permutation before running the deterministic Quicksort algorithm. In this case, 
we have a deterministic algorithm but a probabilistic analysis based on a model of the 
inputs. We again show in this setting that the expected number of comparisons made 
is 2n In « + 0{n). Here, the expectation is over the random choice of inputs. 

The same techniques are generally used both in analyses of randomized algorithms 
and in probabilistic analyses of deterministic algorithms. Indeed, in this application the 
analysis of the randomized Quicksort and the probabilistic analysis of the deterministic 
Quicksort under random inputs are essentially the same. 

Let us first analyze Random Quicksort, the randomized algorithm version of Quick¬ 
sort. 

Theorem 2.11: Suppose that ， whenever a pivot is chosen for Random Quicksort, it is 
chosen independently and uniformly at random from all possibilities. Then, for any in¬ 
put, the expected number of comparisons made by Random Quicksort is 2n\nn- {- O(n). 

Proof: Let y i ， y 2 ，...，）、, be the same values as the input values X\,X 2 ,.. .,x u but sorted 
in increasing order. For/' < 7 , let X (/ be a random variable that takes on the value 1 if 
>V and v 7 are compared at any time over the course of the algorithm, and 0 otherwise. 
Then the total number of comparisons X satisfies 

n~\ n 

x = x u' 

i = \ j = i+\ 

and 
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~ n~\ n - 

Em = E 乙乙 x" 

L/ = 1 j = /4-l -I 

u~ 1 n 

=E E E [ x "-] 

/=1 j=i+] 

by the linearity of expectations. 

Since is an indicator random variable that takes on only the values 0 and 1, E[X ；J -] 
is equal to the probability that X ;/ - is 1. Hence all we need to do is compute the prob¬ 
ability that two elements _y, and yj are compared. Now, _v, and » are compared if and 
only if either y ； or yj is the first pivot selected by Random Quicksort from the set Y ij = 
_y,- + i,..., Vj~i, Jj}. This is because if y, (or yj) is the first pivot selected from this 
set, then y ； and yj must still be in the same sublist, and hence they will be compared. 
Similarly, if neither is the first pivot from this set, then v, and will be separated into 
distinct sublists and so will not be compared. 

Since our pivots are chosen independently and uniformly at random from each sub¬ 
list, it follows that, the first time a pivot is chosen from F", it is equally likely to be any 
element from this set. Thus the probability that _y, or yj is the first pivot selected from 
Y IJ \ which is the probability that X,j = 1, is 2/( j — i + 1). Using the substitution k = 
7 — / + 1 then yields 

«-l n ^ 

/ = 1 J 

n—\ n —/+1 q 

n\ 

/ = 1 k 二 2 
” n+l-A 1 

k 二 2 /. 二 1 

=((n + 1) ^ - 2(n - 1) 

n j 

= (2n + 2) - An. 

k=\ k 

Notice that we used a rearrangement of the double summation to obtain a clean form 
for the expectation. 

Recalling that the summation H(n) = '/k satisfies H{n) = h\n + 0(1), we 

have E[X] = 2n\nn ^ B(n). ■ 
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Next we consider the deterministic version of Quicksort, on random input. We assume 
that the order of the elements in each recursively constructed sublist is the same as in 
the initial list. 

Theorem 2.12: Suppose that, whenever a pivot is chosen for Quicksort, the first ele¬ 
ment of the sublist is chosen. If the input is chosen uniformly at randomfrom all possible 
permutations of the values, then the expected number of comparisons made by Deter¬ 
ministic Quicksort is 2n\nn + O(n). 


Proof: The proof is essentially the same as for Random Quicksort. Again, >/ and yj 
are compared if and only if either y, or yj is the first pivot selected by Quicksort from 
the set Y ,J . Since the order of elements in each sublist is the same as in the original 
list, the first pivot selected from the set Y lj is just the first element from Y ,j in the in¬ 
put list, and since all possible permutations of the input values are equally likely, every 
element in Y'j is equally likely to be first. From this, we can again use linearity of ex¬ 
pectations in the same way as in the analysis of Random Quicksort to obtain the same 
expression for E[X]. ■ 


2.6. Exercises 

Exercise 2.1: Suppose we roll a fair (-sided die with the numbers 1 through k on the 
die’s faces. If X is the number that appears, what is E[X]? 

Exercise 2.2: A monkey types on a 26-letter keyboard that has lowercase letters only. 
Each letter is chosen independently and uniformly at random from the alphabet. If the 
monkey types 1,000,000 letters, what is the expected number of times the sequence 
‘‘proof’’ appears? 

Exercise 2.3: Give examples of functions / and random variables X whereE[/(X)] < 
,/'(E[X]), E[f(X)] = /(E[X]).and E[f(X)] > f(E[X]). 

Exercise 2.4: Prove that E[X A ] > E[XJ A for any even integer k >\. 


Exercise 2.5: If X is a B{n, 1/2) random variable with n > 1, show that the probabil¬ 
ity that X is even is 1/2. 

Exercise 2.6: Suppose that we independently roll two standard six-sided dice. Let X\ 
be the number that shows on the first die, Xj the number on the second die, and X the 
sum of the numbers on the two dice. 


(a) WhatisE[X | Xi is even]? 

(b) What is E[X \ = X 2 V 

(c) What isE[Xi | X = 9]? 

(d) What is E[Xi - X 2 \ X ^ k] for k in the range [2,12]? 
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Exercise 2.7: Let X and Y be independent geometric random variables, where X has 
parameter p and Y has parameter q. 

(a) What is the probability that X = 

(b) What isEtmaxCX,} 7 )]? 

(c) What is Pr(min(X, y)= 々)？ 

(d) WhatisE[X \ X < Y]? 

You may find it helpful to keep in mind the memoryless property of geometric random 
variables. 

Exercise 2.8: (a) Alice and Bob decide to have children until either they have their 
first girl or they have k > 1 children. Assume that each child is a boy or girl indepen¬ 
dently with probability 1/2 and that there are no multiple births. What is the expected 
number of female children that they have? What is the expected number of male chil¬ 
dren that they have? 

(b) Suppose Alice and Bob simply decide to keep having children until they have 
their first girl. Assuming that this is possible, what is the expected number of boys that 
they have? 

Exercise 2.9: (a) Suppose that we roll twice a fair /c-sided die with the numbers 1 
through k on the die’s faces, obtaining values X\ and X 2 . What is E[max( Xi.X 2 )]? 
What isE[min(Xi,X 2 )]? 

(b) Show from your calculations in part (a) that 

E[max(X h X 2 )] + E[min(Xi, X 2 )] = E[Xi ] + E[X 2 J. (2.1) 

(c) Explain why Eqn. (2.1) must be true by using the linearity of expectations instead 
of a direct computation. 

Exercise 2.10: (a) Show by induction that if /: R —> R is convex then, for any 
.vi,X 2 , ...,x n and 入丨，入 2 ,… ，入 n with jy;: 、 入 ， ■ = 1 ， 

/( 亡 Vt,. 

(b) Use Eqn. (2.2) to prove that if /: R —> R is convex then 
E[f(X)] > f(E[X]) 

for any random variable X that takes on only finitely many values. 

Exercise 2.11: Prove Lemma 2.6. 

Exercise 2.12: We draw cards uniformly at random with replacement from a deck of 
n cards. What is the expected number of cards we must draw until we have seen all n 
cards in the deck? If we draw 2n cards, what is the expected number of cards in the 
deck that are not chosen at all? Chosen exactly once? 




( 2 . 2 ) 
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Exercise 2.13: (a) Consider the following variation of the coupon collector’s problem. 
Each box of cereal contains one of 2n different coupons. The coupons are organized 
into n pairs, so that coupons 1 and 2 are a pair, coupons 3 and 4 are a pair, and so on. 
Once you obtain one coupon from every pair, you can obtain a prize. Assuming that 
the coupon in each box is chosen independently and uniformly at random from the 2/7 
possibilities, what is the expected number of boxes you must buy before you can claim 
the prize? 

(b) Generalize the result of the problem in part (a) for the case where there are kn 
different coupons, organized into n disjoint sets of k coupons, so that you need one 
coupon from every set. 


Exercise 2.14: The geometric distribution arises as the distribution of the number of 
times we flip a coin until it comes up heads. Consider now the distribution of the 
number of flips X until the kih head appears, where each coin flip comes up heads in¬ 
dependently with probability p. Prove that this distribution is given by 

Pr(X = n)= 

for n > k. (This is known as the negative binomial distribution.) 

Exercise 2.15: For a coin that comes up heads independently with probability p on 
each flip, what is the expected number of flips until the kth heads? 

Exercise 2.16: Suppose we flip a coin n times to obtain a sequence of flips Xi, X 2 ,.. 
X n . A streak of flips is a consecutive subsequence of flips that are all the same. For ex¬ 
ample. if X 3 . X 4 . and Xf, are all heads, there is a streak of length 3 starting at the third 
flip. (If X(, is also heads, then there is also a streak of length 4 starting at the third flip.) 



(a) Let/? be a power of 2. Show that the expected number of streaks of length log 2 n + 1 
is 1 — o( 1). 

(b) Show that, for sufficiently large n. the probability that there is no streak of length 
at least [log ： n — 2 log ： log: /，」 is less than \/n. (Hint: Break the sequence of flips 
up into disjoint blocks of [log^ n — 2 log 2 log 2 行」 consecutive flips, and use that 
the event that one block is a streak is independent of the event that any other block 
is a streak.) 

Exercise 2.17: Recall the recursive spawning process described in Section 2.3. Sup¬ 
pose that each call to process S recursively spawns new copies of the process S, where 
the number of new copies is 2 with probability p and 0 with probability \ — p. li Y, 
denotes the number of copies of S in the ith generation, determine E[^-]. For what 
values of p is the expected total number of copies bounded? 


Exercise 2.18: The following approach is often called reservoir sampling. Suppose 
we have a sequence of items passing by one at a time. We want to maintain a sample 
of one item with the property that it is uniformly distributed over all the items that we 
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have seen at each step. Moreover, we want to accomplish this without knowing the 
total number of items in advance or storing all of the items that we see. 

Consider the following algorithm, which stores just one item in memory at all times. 
When the first item appears, it is stored in the memory. When the klh item appears, it 
replaces the item in memory with probability \/k. Explain why this algorithm solves 
the problem. 

Exercise 2.19: Suppose that we modify the reservoir sampling algorithm of Exer¬ 
cise 2.18 so that, when the kth item appears, it replaces the item in memory with 
probability 1/2. Describe the distribution of the item in memory. 

Exercise 2.20: A permutation on the numbers [1, 〃J can be represented as a function 
.T : [1,«] — > [1, /?], where n(i) is the position of i in the ordering given by the permuta¬ 
tion. A fixed point of a permutation tt : fl,^] —> [1, n] is a value for which tt(.x) = x . 
Find the expected number of fixed points of a permutation chosen uniformly at random 
from all permutations. 


Exercise 2.21: Let a\,ai,... M n be a random permutation of {1. 2./?}, equally 

likely to be any of the n\ possible permutations. When sorting the list a\.az . a ”， 

the element a, must move a distance of la, — i \ places from its current position to reach 
its position in the sorted order. Find 

- n - 

e I ， 

L /=] 」 

the expected total distance that elements will have to be moved. 

Exercise 2.22: Let a\,ci 2 __ a n be a list of n distinct numbers. We say that a, and 

a, are inverted if i < j but a / > cij. The Bubblesort sorting algorithm swaps pairwise 
adjacent inverted numbers in the list until there are no more inversions, so the list is 
in sorted order. Suppose that the input to Bubblesort is a random permutation, equally 
likely to be any of the n \ permutations of n distinct numbers. Determine the expected 
number of inversions that need to be corrected by Bubblesort. 


Exercise 2.23: Linear insertion sort can sort an array of numbers in place. The first 
and second numbers are compared; if they are out of order, they are swapped so that 
they are in sorted order. The third number is then placed in the appropriate place in the 
sorted order. It is first compared with the second; if it is not in the proper order, it is 
swapped and compared with the first. Iteratively, the kth number is handled by swap¬ 
ping it downward until the first k numbers are in sorted order. Determine the expected 
number of swaps that need to be made with a linear insertion sort when the input is a 
random permutation of n distinct numbers. 


Exercise 2.24: We roll a standard fair die over and over. What is the expected number 
of rolls until the first pair of consecutive sixes appears? {Hint: The answer is not 36.) 
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Exercise 2.25: A blood test is being performed on n individuals. Each person can be 
tested separately, but this is expensive. Pooling can decrease the cost. The blood sam¬ 
ples of k people can be pooled and analyzed together. If the test is negative, this one 
test suffices for the group of k individuals. If the test is positive, then each of the k per¬ 
sons must be tested separately and thus 十 1 total tests are required for the k people. 

Suppose that we create n/k disjoint groups of k people (where k divides n) and use 
the pooling method. Assume that each person has a positive result on the test indepen¬ 
dently with probability p. 


(a) What is the probability that the test for a pooled sample of k people will be posi¬ 
tive? 

(b) What is the expected number of tests necessary? 

(c) Describe how to find the best value of k. 

(d) Give an inequality that shows for what values of p pooling is better than just test¬ 
ing every individual. 

Exercise 2.26: A permutation 丌： [l，nj -> [\,n] can be represented as a set of cycles 
as follows. Let there be one vertex for each number /. i = 1,... ,/ 7 . If the permuta¬ 
tion maps the number / to the number 7i(/'), then a directed arc is drawn from vertex i 
to vertex jt(/). This leads to a graph that is a set of disjoint cycles. Notice that some 
of the cycles could be self-loops. What is the expected number of cycles in a random 
permutation of n numbers? 

Exercise 2.27: Consider the following distribution on the integers x > 1: Pr(X = x)= 
(6/jt 2 )x~ 2 . This is a valid distribution, since k- 1 = tt 2 /6. What is its ex¬ 

pectation? 

Exercise 2.28: Consider a simplified version of roulette in which you wager x dollars 
on either red or black. The wheel is spun, and you receive your original wager plus 
another x dollars if the ball lands on your color; if the ball doesn’t land on your color, 
you lose your wager. Each color occurs independently with probability 1/2. (This is a 
simplification because real roulette wheels have one or two spaces that are neither red 
nor black, so the probability of guessing the correct color is actually less than 1/2.) 

The following gambling strategy is a popular one. On the first spin, bet 1 dollar. If 
you lose, bet 2 dollars on the next spin. In general, if you have lost on the first k — 1 
spins, bet 1 dollars on the 众 th spin. Argue that by following this strategy you will 
eventually win a dollar. Now let X be the random variable that measures your maxi¬ 
mum loss before winning (i.e., the amount of money you have lost before the play on 
which you win). Show that E[X] is unbounded. What does it imply about the practi¬ 
cality of this strategy? 


Exercise 2.29: Prove that, if X(), X\,... is a sequence of random variables such that 

OC 

J2e[\x,\] 

. 7=0 
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converges, then the linearity of expectations holds: 

L j '=0 」 j =0 


Exercise 2.30: In the roulette problem of Exercise 2.28, we found that with probabil¬ 
ity 1 you eventually win a dollar. Let Xj be the amount you win on the jth bet. (This 
might be 0 if you have already won a previous bet.) Determine E[X 7 ] and show that, 
by applying the linearity of expectations, you find your expected winnings are 0. Does 
the linearity of expectations hold in this case? (Compare with Exercise 2.29.) 


Exercise 2.31: A variation on the roulette problem of Exercise 2.28 is the following. 
We repeatedly flip a fair coin. You pay j dollars to play the game. If the first head 
comes up on the kth flip, you win 2 k /k dollars. What are your expected winnings? 
How much would you be willing to pay to play the game? 


Exercise 2.32: You need a new staff assistant, and you have n people to interview. You 
want to hire the best candidate for the position. When you interview a candidate, you 
can give them a score, with the highest score being the best and no ties being possible. 
You interview the candidates one by one. Because of your company's hiring practices, 
after you interview the 众 th candidate, you either offer the candidate the job before the 
next interview or you forever lose the chance to hire that candidate. We suppose the 
candidates are interviewed in a random order, chosen uniformly at random from all nl 
possible orderings. 

We consider the following strategy. First, interview m candidates but reject them all: 
these candidates give you an idea of how strong the field is. After the 川 th candidate, 
hire the first candidate you interview who is better than all of the previous candidates 
you have interviewed. 


(a) Let E be the event that we hire the best assistant, and let E, be the event that /th 
candidate is the best and we hire him. Determine Pr(£";), and show that 


(b) 


m 

Pr(£)=— 
n 


tlh- 


Bound E^,,+i jz\ t0 obtain 


(c) 


—(ln/7 — lnm) < Pr(£ , ) < —(ln(« — 1) — \n(m — 1)). 
n n 

Show that m(\nn — \nm)/n is maximized when m = n/e, and explain why this 
means Pr(£ ， ) > 1/e for this choice of m. 
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CHAPTER THREE 


Moments and Deviations 


In this and the next chapter we examine techniques for bounding the toil distribution, 
the probability that a random variable assumes values that are far from its expectation. 
In the context of analysis of algorithms, these bounds are the major tool for estimat¬ 
ing the failure probability of algorithms and for establishing high probability bounds 
on their run-time. In this chapter we study Markov’s and Chebyshev’s inequalities and 
demonstrate their application in an analysis of a randomized median algorithm. The 
next chapter is devoted to the Chemoff bound and its applications. 


3.1. Markov’s Inequality 

Markov's inequality, formulated in the next theorem, is often too weak to yield useful 
results, but it is a fundamental tool in developing more sophisticated bounds. 


Theorem 3.1 [Markov’s Inequality]: Let X be a random variable that assumes only 
nonnegative values. Then, for all a > 0. 


Pr(X >a)< 


E[X] 




Proof: For a > 0, let 


and note that, since X >0. 



if X> a, 
otherwise. 


I < 


X 

a 


Because / is a 0-1 random variable, E[/] : 
Taking expectations in (3.1) thus yields 


Pr (/ = 1) = Pr(X > a). 


Pr(X >a) = E[/]<E 


E[X] 


(3.1) 


■ 
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3.2 VARIANCE AND MOMENTS OF A RANDOM VARIABLE 


For example, suppose we use Markov’s inequality to bound the probability of obtain¬ 
ing more than 3n/4 heads in a sequence of n fair coin flips. Let 


Xi = 


if the ith coin flip is heads, 
otherwise. 


and let X = X, denote the number of heads in the n coin flips. Since E[X/]= 
Pr(Xi = 1) = 1/2, it follows that E[X] = ^ ； =1 E[X ( ] = n/2. Applying Markov’s 
inequality, we obtain 


Pr(X > 3n/4) < 


E[X] _ n/2 _ 2 
3n/4 — 3n/4 = 3 


3.2. Variance and Moments of a Random Variable 

Markov’s inequality gives the best tail bound possible when all we know is the expec¬ 
tation of the random variable and that the variable is nonnegative (see Exercise 3.16). 
It can be improved upon if more information about the distribution of the random vari¬ 
able is available. 

Additional information about a random variable is often expressed in terms of its 
moments. The expectation is also called the first moment of a random variable. More 
generally, we define the moments of a random variable as follows. 

Definition 3.1: The kth moment of a random variable X is E[X^]. 

A significantly stronger tail bound is obtained when the second moment (E[X 2 ]) is also 
available. Given the first and second moments, one can compute the variance and stan¬ 
dard deviation of the random variable. Intuitively, the variance and standard deviation 
offer a measure of how far the random variable is likely to be from its expectation. 

Definition 3.2: The variance of a random variable X is defined as 
Var[X] = E[(X - E[X]) 2 ] = E[X 2 ] - (E[X]) 2 . 

The standard deviation of a random variable X is 

a[X] = v/Var[X]. 

The two forms of the variance in the definition are equivalent, as is easily seen by using 
the linearity of expectations. Keeping in mind that E[X] is a constant, we have 

E[(X - E[X]) 2 ] = E[X 2 - 2XE[X] 十 E[X} 2 ] 

= E[X 2 ] - 2E[XE[X]] + E[X] 2 
= E[X 2 ] - 2E[X]E[X ] 十 E[X] 2 
= E[X 2 ]-(E[X]) 2 . 
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If a random variable X is constant - so that it always assumes the same value - 
then its variance and standard deviation are zero. More generally, if a random vari¬ 
able X takes on the value ^E[X] with probability l/k and the value 0 with probability 
l — \/k, then its variance is (k — 1)(E[X]) 2 and its standard deviation is y/k — 1E[X]. 
These cases help demonstrate the intuition that the variance (and standard deviation) 
of a random variable are small when the random variable assumes values close to its 
expectation and are large when it assumes values far from its expectation. 

We have previously seen that the expectation of the sum of two random variables is 
equal to the sum of their individual expectations. It is natural to ask whether the same 
is true for the variance. We find that the variance of the sum of two random variable 
has an extra term, called the covariance. 

Definition 3.3: The covariance of two random variables X and Y is 
Cov(X, r) = E[(X - E[X])(K - E\Y])]. 


Theorem 3.2: For any Mo random variables X and Y. 


\ar[X 十 F] = Var[X] + Var[r] + 2 Cov(X, K). 


Proof: 


Var[X + Y] = E[(X + r - E[X + Y]) 2 ] 


= E[(X + r-E[X]-E[r]) 2 ] 

=E[(X- E[X1) 2 十 （r — E[F]) 2 + 2(X- E[X])(Y - E[F])] 

= E[(X- E[X]) 2 ] + E[(K - E[K]) 2 ] + 2E[(X- E[X])(r- E[F])] 
=Var[X] + Var[F] + 2 Cov(X, K). ■ 


The extension of this theorem to a sum of any finite number of random variables is 
proven in Exercise 3.14. 

The variance of the sum of two (or any finite number of) random variables does 
equal the sum of the variances when the random variables are independent. Equiva¬ 
lently, if X and Y are independent random variables, then their covariance is equal to 
zero. To prove this result, we first need a result about the expectation of the product of 
independent random variables. 


Theorem 3.3: IfX and Y are two independent random variables, then 
E[X - Y] = E[X] • E[Y]. 


Proof: In the summations that follow, let / take on all values in the range of X, and let 
j take on all values in the range of Y : 


46 
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E[X -Y] = J2 Pr((X = i) n(Y = j)) 

/ ■/ 

= J2 J2 {i ,jh Pr{X = n ■ Pr()/ = ■/) 




i Pr(X = /) 


Pr(K = j) 


= E[X]-E[Yl 

where the independence of X and Y is used in the second line. 


■ 


Unlike the linearity of expectations, which holds for the sum of random variables 
whether they are independent or not, the result that the expectation of the product of 
two (or more) random variables is equal to the product of their expectations does not 
necessarily hold if the random variables are dependent. To see this, let Y and Z each 
correspond to fair coin flips, with Y and Z taking on the value 0 if the flip is heads and 
1 if the flip is tails. Then EfF] = EfZ] = 1/2. If the two flips are independent, then 
Y ■ Z \s \ with probability 1/4 and 0 otherwise, so indeed E[K ■ ZJ = E[F] - E[Z]. 
Suppose instead that the coin flips are dependent in the following way : the coins are 
tied together, so Y and Z either both come up heads or both come up tails together. 
Each coin considered individually is still a fair coin flip, but now F ■ Z is 1 with prob¬ 
ability 1/2 and soErr • Z] # E[Y] - E[Z], 


Corollary 3.4: IfX and Y are independent random variables, then 


and 


Co\(X, Y) = 0 

Var[X + H = Var[Xj + Var[y], 


Proof: 


Cov(u) = E[(X - E[X])(y - E[Y])] 
= E[X-E[X]]-E[r-E[r]] 
= 0 . 


In the second equation we have used the fact that, since X and Y are independent, so 
are X — E[X] and K — Ef F] and hence Theorem 3.3 applies. For the last equation we 
use the fact that, for any random variable Z, 

E[(Z — E[ZJ)] = E[Z] - E[E[Z]J = 0. 

Since Cov(X, Y) = 0, we have \ar[X ^ Y] = Var[X] + Var[H. ■ 


By induction we can extend the result of Corollary 3.4 to show that the variance of 
the sum of any finite number of independent random variables equals the sum of their 
variances. 
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Theorem 3.5: Let X\, X 2 , ^ ^ X n be mutually independent random variables. Then 


- n - 

Var 


n 

=^Var[X f ]. 


3.2.1. Example: Variance of a Binomial Random Variable 

The variance of a binomial random variable X with parameters n and p can be deter¬ 
mined directly by computing E[X 2 ]: 


E[X 2 ] 


訧) 


P J (\ - P) n - J f 


n\ 




E 


(n - j)!j! 


P j (\-P) n 


E 


(n-j)lj 


p j (\-pr 


n(n — l)p 2 ^ 


(n-2)\ 


- jy- o' - 2)! 




n{n — \)p l ^ np_ 


Here we have simplified the summations by using the binomial theorem. We con¬ 
clude that 

\ar[X] = E[X 2 ]-(E[X]) 2 

= n{n — l)/? 2 + np — n 2 p 2 
— np — np 2 
= np{\ - p). 

An alternative derivation makes use of independence. Recall from Section 2.2 that a 
binomial random variable X can be represented as the sum of n independent Bernoulli 
trials, each with success probability p. Such a Bernoulli trial Y has variance 

E[(Y - E[Y]) 2 } = p(\ - p) 2 + (1 — p)(-p) 2 = p - p 2 = p(\ - p). 

By Corollary 3.4, the variance of X is then np{\ — p). 


3.3. Chebyshev’s Inequality 

Using the expectation and the variance of the random variable, one can derive a signif¬ 
icantly stronger tail bound known as Chebyshev’s inequality. 
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3.3 CHEBYSHEV’S INEQUALITY 


Theorem 3.6 [Chebyshev’s Inequality]: For any a > 0 ， 

Var[X] 

Pr(\X-E[X]\>a) < 卜义 
a 1 

Proof: We first observe that 

Pr(|X - E[X]| >a)= Pr((X - E[X]) 2 > a 2 ). 

Since (X — E[X]) 2 is a nonnegative random variable, we can apply Markov’s inequal¬ 
ity to prove: 


Pr((X-E[X]) 2 > a 2 ) < 


E[(X -E[X]) 2 ] — Var[X] 
a 2 a 2 


■ 


The following useful variants of Chebyshev's inequality bound the deviation of the ran¬ 
dom variable from its expectation in terms of a constant factor of its standard deviation 
or expectation. 


Corollary 3.7: For any t > \, 

Pr(|X-E[X]| >r-a[X]) < i and 


Pr(|X-E[X]| >r-E[X])< 


Var[X] 

t 2 (E[X}) 2 ' 


Let us again consider our coin-flipping example, and this time use Chebyshev’s in¬ 
equality to bound the probability of obtaining more than 3^/4 heads in a sequence of 
n fair coin flips. Recall that X, = 1 if the ith coin flip is heads and 0 otherwise, and 
that X — denotes the number of heads in the n coin flips. To use Chebyshev’s 

inequality we need to compute the variance of X. Observe first that, since X,. is a 0—1 


random variable, 
Thus, 


E[(X,) 2 ] = E[X ; ] 


Var[X / ] = E[(X,) 2 ]-(E[X ) ]) : 


4 


Now, since X = ^ n l= , X t and the X ; are independent, we can use Theorem 3.5 to 


compute 


Var[X] = Var 


J2 Xi = J2 Xar[Xi] = 

/ 二 i j / =i 


Applying Chebyshev’s inequality then yields 


Pr(X > 3«/4) < Pr(|X - E[X]| > n/4) 
/ Var[X] 

- (n/4) 2 
n/4 
(n/4-) 2 



MOMENTS AND DEVIATIONS 


In fact, we can do slightly better. Chebyshev’s inequality yields that A/n is actu¬ 
ally a bound on the probability that X is either smaller than n /4 or larger than 3«/4, so 
by symmetry the probability that X is greater than 3n/4 is actually 2/n. Chebyshev’s 
inequality gives a significantly better bound than Markov’s inequality for large n. 

3.3.1. Example: Coupon Collector’s Problem 

We apply Markov’s and Chebyshev’s inequalities to the coupon collector’s prob¬ 
lem. Recall that the time X to collect n coupons has expectation nH n , where H n = 
l/n = In + 0(1). Hence Markov’s inequality yields 

Pr(X > 2nH„) < 

To use Chebyshev’s inequality, we need to find the variance of X. Recall again from 
Section 2.4.1 that X = X” where the X t are geometric random variables with 
parameter (/?—/ + \)/n. In this case, the X, are independent because the time to col¬ 
lect the ;th coupon does not depend on how long it took to collect the previous i — 1 
coupons. Hence 


Var[X] = Var 


=E Var[ ^ L 


so we need to find the variance of a geometric random variable. 

Let y be a geometric random variable with parameter p. As we saw in Section 2.4, 
E[X] — \/p. We calculate E[Y 2 ]. The following trick proves useful. We know that, 
for 0 < x < 1, 


E 


Taking derivatives, we find: 


(1 -x) : 


E 




i = 0 


2 


(\-xy 


》•(/ - 1 )? 




We can conclude that 
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oc 

7=1 

OG OC 00 

= (/ +1)(/+ 2)x' - 3 (/ + i)x ； + y^x’ 

/ 二 0 / = 0 i = 0 

2 ^ 1 1 

= - 3 - 1 - 

(1 — x) 3 (1 — x) 2 (1 — x) 

x 2 +x 

= (T^- 

We now use this to find 


Y/ 


Finally, we reach 


OC 

E[V 2 ] = J^p(l- P y- , r 

i=\ 


P 

1 - P 


E (1 -w 

/ =i 


p ( i -/?) 2 + d -/?) 

2 - /? 

了‘ 


Var[K] =E[r 2 ]-E[F ] 2 
= 2-_p_J_ 

P 2 P 2 
_ 1 - P 
P 2 


We have just proven the following useful lemma. 


Lemma 3.8: The variance of a geometric random variable with parameter p is 
(1 - P)/P 2 - 

For a geometric random variable Y, E[F 2 ] can also be derived using conditional expec¬ 
tations. We use that Y corresponds to the number of flips until the first heads, where 
each flip is heads with probability p. Let X = 0 if the first flip is tails and X = 1 if the 
first flip is heads. By Lemma 2.5, 

E[Y 2 ] = Pr(X = 0)E[K 2 | X = 0] + Pr(X = 1)E[K 2 \ X = \] 

=(1 — p)E[Y 2 I X = 0] + pE[Y 2 I X = 1], 

If X = l,then Y = 1 and soE[r 2 | X = 1] = 1. IfX = 0, then K > 1. In this case, 
let the number of remaining flips after the first flip until the first head be Z. Then 
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E[F 2 ] = (l-/7)E[(Z + l) 2 ] + /7-l 

=(1 — /7)E[Z 2 ] + 2(1 — P)E[Z] + 1 (3.2) 


by the linearity of expectations. By the memoryless property of geometric random vari¬ 
ables, Z is also a geometric random variable with parameter p. Hence E[Z] = \/p 
and E[Z 2 ] = E[K 2 ]. Plugging these values into Eqn. (3.2), we have 


e[k 2 ] = (i —/7)E[y 2 ] + 2(1 — P ) 


=(1 - 


P)E[r 2 ] 十 


2 -尸 

p 


which yields E[T 2 ] = (2 — p)/p 2 , matching our other derivation. 

We return now to the question of the variance in the coupon collector’s problem. 
We simplify the argument by using the upper bound VarfK] < \/p 2 for a geometric 
random variable, instead of the exact result of Lemma 3.8. Then 


// 

Var[X] = ^Var[X/] < 



6 


Here we have used the identity 



71 


6 


Now, by Chebyshev’s inequality, 


Pr(|X - nH„ \ > nH n ) < 


trie 1 /6 it 2 
(nH fl ) 2 = 6(//, ) 2 



In this case, Chebyshev’s inequality again gives a much better bound than Markov’s 
inequality. But it is still a fairly weak bound, as we can see by considering instead a 
fairly simple union bound argument. 

Consider the probability of not obtaining the /th coupon after n In « + cn steps. This 
probability is 



< e 


-( Inn 十 （.） 


By a union bound, the probability that some coupon has not been collected after 
n\nn cn steps is only e—In particular, the probability that all coupons are not 
collected after 2n In n steps is at most 1//?, a bound that is significantly better than what 
can be achieved even with Chebyshev’s inequality. 


3.4. Application: A Randomized Algorithm for 
Computing the Median 


Given a set S of n elements drawn from a totally ordered universe, the median of 5 is 
an element m of S such that at least elements in S are less than or equal to m 
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and at least + 1 elements in S are greater than or equal to m. If the elements in 

S are distinct, then m is the (f/?/2])th element in the sorted order of 

The median can be easily found deterministically in 0(n log/?) steps by sorting, and 
there is a relatively complex deterministic algorithm that computes the median in O(n) 
time. Here we analyze a randomized linear time algorithm that is significantly sim¬ 
pler than the deterministic one and yields a smaller constant factor in the linear running 
time. To simplify the presentation, we assume that n is odd and that the elements in the 
input set S are distinct. The algorithm and analysis can be easily modified to include 
the case of a multi-set S (see Exercise 3.23) and a set with an even number of elements. 

3.4.1. The Algorithm 

The main idea of the algorithm involves sampling, which we first discussed in Sec¬ 
tion 1.2. The goal is to find two elements that are close together in the sorted order of S 
and that have the median lie between them. Specifically, we seek two elements d,u e S 
such that: 


1. d < in < u (the median m is between d and u): and 

2. for C = {s e S : d < s < u), |C| = o{n/\ogn) (the total number of elements 
between d and u is small). 


Sampling gives us a simple and efficient method for finding two such elements. 

We claim that, once these two elements are identified, the median can easily be 
found in linear time with the following steps. The algorithm counts (in linear time) 
the number of elements of S that are smaller than d and then sorts (in sublinear. or 
o(«), time) the set C. Notice that, since \C\ = o(n/\ogn), the set C can be sorted in 
time o(n) using any standard sorting algorithm that requires 0 (”1 log/") time to sort 
m elements. The ( 1_«/2 」 一 £(/ + l)th element in the sorted order of C is m, since there 
are exactly L«/2 」 elements in S that are smaller than that value ( [_n/2 — U in the set 
C and £,/ in 5 - C). 

To find the elements d and u, we sample with replacement a multi-set R of「《 3/4 ] 
elements from S. Recall that sampling with replacement means each element in R is 
chosen uniformly at random from the set S, independent of previous choices. Thus, the 
same element of S might appear more than once in the multi-set R. Sampling without 
replacement might give marginally better bounds, but both implementing and analyz¬ 
ing it are significantly harder. It is worth noting that we assume that an element can be 
sampled from S in constant time. 

Since 尺 is a random sample of S we expect m, the median element of 5. to be close 
to the median element of R. We therefore choose d and u to be elements of R sur¬ 
rounding the median of R. 

We require all the steps of our algorithm to work with high probability, by which 
we mean with probability at least 1 — 0(\/n c ) for some constant c > 0. To guarantee 
that with high probability the set C includes the median we fix d and u to be re¬ 
spectively the |_« 3/4 /2 - th and the |_« 3/4 /2 + V^"」th elements in the sorted order of 
R. With this choice, the set C includes all the elements of S that are between the 2^/n 
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Randomized Median Algorithm: 

Input: A set S of n elements over a totally ordered universe. 

Output: The median element of S, denoted by m. 

1. Pick a (multi-)set R of 「《 3 〆 4 ] elements in S. chosen independently and 
uniformly at random with replacement. 

2. Sort the set R. 

3. Let cl be the [\_\n 3/4 - ^/n\)th smallest element in the sorted set R. 

4. Let u be the ([ \n 3/4 + V^])th smallest element in the sorted set R. 

5. By comparing every element in S to d and u. compute the set 

C = {x ^ S : d < x < u} and the numbers t ( i = |{a' e 5 : x < t/}| and 
= \{x e S : x > u}\. 

6. If i ( i > n/2ov i u > n/2 then FAIL. 

7. If ICI < 4« 3/4 then sort the set C, otherwise FAIL. 

8. Output the (L«/2 」一 十 l)th element in the sorted order of C. 


Algorithm 3.1: Randomized median algorithm. 

sample points surrounding the median of R. The analysis will clarify that the choice of 
the size of R and the choices for d and u are tailored to guarantee both that (a) the set 
C is large enough to include m with high probability and (b) the set C is sufficiently 
small so that it can be sorted in sublinear time with high probability. 

A formal description of the procedure is presented as Algorithm 3.1. In what fol¬ 
lows. for convenience we treat ^/n and » 3/4 as integers. 


3.4.2. Analysis of the Algorithm 


Based on our previous discussion, we first prove that - regardless of the random choices 
made throughout the procedure - the algorithm (a) always terminates in linear time and 
(b) outputs either the correct result or FAIL. 


Theorem 3.9: The randomized median algorithm terminates in linear time, and if it 
does not output FAIL then it outputs the correct median element of the input set S. 


Proof: Correctness follows because the algorithm could only give an incorrect answer 
if the median were not in the set C. But then either Id > n/2 or t u > n/2 and thus 
step 6 of the algorithm guarantees that, in these cases, the algorithm outputs FAIL. 
Similarly, as long as C is sufficiently small, the total work is only linear in the size of 
S. Step 7 of the algorithm therefore guarantees that the algorithm does not take more 
than linear time; if the sorting might take too long, the algorithm outputs FAIL with¬ 
out sorting. ■ 

The interesting part of the analysis that remains after Theorem 3.9 is bounding the prob¬ 
ability that the algorithm outputs FAIL. We bound this probability by identifying three 


54 





3.4 APPLICATION ： A RANDOMIZED ALGORITHM FOR COMPUTING THE MEDIAN 


‘‘bad’’ events such that, if none of these bad events occurs, the algorithm does not fail. 
In a series of lemmas, we then bound the probability of each of these events and show 
that the sum of these probabilities is only 0 (n- 1//4 ). 

Consider the following three events: 

^i ： Fi = \{reR \ r < m}\ < ^ 3 / 4 - v ^； 

£ 2 ： Y 2 = \{r^R\r> m}\ < 士《 3 / 4 - 批 
Sy. \C\ > 


Lemma 3.10: The randomized median algorithm fails if and only if at least one of 
E 2 , or ^3 occurs. 

Proof: Failure in step 7 of the algorithm is equivalent to the event ^ 3 . Failure in step 6 
of the algorithm occurs if and only if £ ( i > n/2 or i u > n/2. But for t d > n/2, the 
(^« 3/4 — -sjn )th smallest element of R must be larger than m: this is equivalent to the 
event E\. Similarly, i u > n/2 is equivalent to the event 62 . ■ 


Lemma 3.11: 


P 輕 


Proof: Define a random variable X ； by 

1 if the ith sample is less than or equal to the median. 
0 otherwise. 


X, 


The X, are independent, since the sampling is done with replacement. Because there 
are (n — 1)/2 + 1 elements in S that are less than or equal to the median, the proba¬ 
bility that a randomly chosen element of 5 is less than or equal to the median can be 


written as 


P^ = l) = ^i±i = i + ^ 

n 2 2 n 


The event E\ is equivalent to 




< 2 n 


3/4 


\fn. 


Since Y\ is the sum of Bernoulli trials, it is a binomial random variable with param¬ 
eters n 3/4 and 1/2 + \/2n. Hence, using the result of Section 3.2.1 yields 


Var[Fi] = n 


3/4 I 


~2 + Yn)\2~Yn 


， V4 


< A n 


3/4 
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Applying Chebyshev’s inequality then yields 


Pr(^i) = Pr^F, < ^ 3/4 - ^ 

<Pr(|F,-E[F,]|>v^) 
.Var[F,] 


- 


We similarly obtain the same bound for the probability of the event £ 2 . We now bound 
the probability of the third bad event, E%. 


Lemma 3.12: 

Pr ⑸ 5 X ~n^ ! \ 

Proof: If occurs, so |C| > 4« 3 / 4 , then at least one of the following two events 
occurs: 


£■ 3 . 1 ： at least 2n 3/4 elements of C are greater than the median; 
£■ 3 . 2 ： at least 2 « 3/4 elements of C are smaller than the median. 


Let us bound the probability that the first event occurs; the second will have the same 
bound by symmetry. If there are at least 2 / 7 3/4 elements of C above the median, then 
the order of w in the sorted order of S was at least + 2n 3/4 and thus the set R has at 
least \n 3 ^ 4 — Jn samples among the — 2/? 3 ’ 4 largest elements in S. 

Let X be the number of samples among the \n — 2« 3 〆 4 largest elements in S. Let 
X = 知 where 


X,= 


if the ith sample is among the - 2n 3/4 largest elements in S, 
otherwise. 


Again, X is a binomial random variable, and we find 
E[X]= l -n 3/4 -2^ 

and 

Var[X] = « 3/4 (去 一 2 «- 1/4 )(去 + 2«- 1/4 ) = < X -n^\ 

Applying Chebyshev’s inequality yields 

(3.3) 

(3.4) 


Pr(^,i) = Pr(X 


丄^/4 


W) 


Pr(|X - E[X]| > v^) < 1/4 

n n 4 


Similarly, 
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Pr^.1,2) < >_ 1/4 

and 

Pr(£ ：1 ) <Pr(^.i) + Pr(^.2) < 1/4 ■ 

Combining the bounds just derived, we conclude that the probability that the algorithm 
outputs FAIL is bounded by 

Pr(£ ： i) + Pr(^ 2 ) + Pr(<? 3 ) < ” — 1/,4 . 

This yields the following theorem. 

Theorem 3.13: The probability that the randomized median algorithm fails is bounded 
by n- 1/4 . 


By repeating Algorithm 3.1 until it succeeds in finding the median, we can obtain an 
iterative algorithm that never fails but has a random running time. The samples taken 
in successive runs of the algorithm are independent, so the success of each run is in¬ 
dependent of other runs, and hence the number of runs until success is achieved is a 
geometric random variable. As an exercise, you may wish to show that this variation of 
the algorithm (that runs until it finds a solution) still has linear expected running time. 

Randomized algorithms that may fail or return an incorrect answer are called Monte 
Carlo algorithms. The running time of a Monte Carlo algorithm often does not depend 
on the random choices made. For example, we showed in Theorem 3.9 that the ran¬ 
domized median algorithm always terminates in linear time, regardless of its random 
choices. 

A randomized algorithm that always returns the right answer is called a Las Vegas 
algorithm. We have seen that the Monte Carlo randomized algorithm for the median 
can be turned into a Las Vegas algorithm by running it repeatedly until it succeeds. 
Again, turning it into a Las Vegas algorithm means the running time is variable, al¬ 
though the expected running time is still linear. 


3.5. Exercises 

Exercise 3.1: Let X bea number chosen uniformly at random from [1./?]. Find Var[X]. 

Exercise 3.2: Let X be a number chosen uniformly at random from [― 人 ', 々 ]. Find 
Var[X]. 

Exercise 3.3: Suppose that we roll a standard fair die 100 times. Let X be the sum 
of the numbers that appear over the 100 rolls. Use Chebyshev’s inequality to bound 
Pr(|X - 350| > 50). 

Exercise 3.4: Prove that, for any real number c and any discrete random variable X, 
Var[cX] = c 2 Var[X], 
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Exercise 3.5: Given any two random variables X and Y, by the linearity of expecta¬ 
tions we have E[X — Y] = E[X] — E[K], Prove that, when X and Y are independent, 
Var[X - K] = Var[X] + Var[r]. 

Exercise 3.6: Fora coin that comes up heads independently with probability p on each 
flip, what is the variance in the number of flips until the kth head appears? 


Exercise 3.7: A simple model of the stock market suggests that, each day, a stock with 
price q will increase by a factor r > \ to qr with probability p and will fall to q/r with 
probability 1 — p. Assuming we start with a stock with price 1, find a formula for the 
expected value and the variance of the price of the stock after d days. 


Exercise 3.8: Suppose that we have an algorithm that takes as input a string of n bits. 
We are told that the expected running time is 0(n 2 ) if the input bits are chosen inde¬ 
pendently and uniformly at random. What can Markov's inequality tell us about the 
worst-case running time of this algorithm on inputs of size /?? 

Exercise 3.9: (a) Let X be the sum of Bernoulli random variables, X = X,. The 

Xj do not need to be independent. Show that 


// 

E[X 2 ] = ^Pr(X, = \)E[X I X, = 1], (3.5) 

/ = 1 

Hint: Start by showing that 

n 

E[X 2 ] = ^E[X,X], 


and then apply conditional expectations. 

(b) Use Eqn. (3.5) to provide another derivation for the variance of a binomial ran¬ 
dom variable with parameters n and p. 


Exercise 3.10: For a geometric random variable X, find E[X 3 ] and E[X 4 ]. (Hint: Use 
Lemma 2.5.) 


Exercise 3.11: Recall the Bubblesort algorithm of Exercise 2.22. Determine the vari¬ 
ance of the number of inversions that need to be corrected by Bubblesort. 

Exercise 3.12: Find an example of a random variable with finite expectation and un¬ 
bounded variance. Give a clear argument showing that your choice has these properties. 

Exercise 3.13: Find an example of a random variable with finite jth moments for 1 < 
j < k but an unbounded (k + l)th moment. Give a clear argument showing that your 
choice has these properties. 
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Exercise 3.14: Prove that, for any finite collection of random variables X\, Xj, 


Var 


n - 

/ = 1 J 


n n 

=Var [兄 ]+ 2 ^ ^ Cov( X, ， X/、. 

/ =i / =i j>i 


X n , 


Exercise 3.15: Let the random variable X be representable as a sum of random vari¬ 
ables X = J2'i=\ Show that, if EfX/Xy] = E[X, ]E[X y ] for every pair of i and j 
with 1 < / < j < n, then Var[X] = [: ! =i Var[AV]. 

Exercise 3.16: This problem shows that Markov's inequality is as tight as it could pos¬ 
sibly be. Given a positive integer k, describe a random variable X that assumes only 
nonnegative values such that 


Pr(X > kE[X ])= 

k 

Exercise 3.17: Can you give an example (similar to that for Markov's inequality in 
Exercise 3.16) that shows that Chebyshev’s inequality is tight? If not. explain why not. 


Exercise 3.18: Show that, for a random variable X with standard deviation cr[X] and 
any positive real number t: 


⑻ 

(b) 


Pr(X-E[X] > ta[X]) < 


1 十 r 2 . 


Pr(|X-E[X]j >ta[X]) < 


Exercise 3.19: Let y be a nonnegative integer-valued random variable with positive 
expectation. Prove 


E[n 2 

E[F 2 ] 


< Pr[F ^ 0] < E[Y]. 


Exercise 3.20: (a) Chebyshev’s inequality uses the variance of a random variable to 
bound its deviation from its expectation. We can also use higher moments. Suppose 
that we have a random variable X and an even integer k for which E[(X — E[X] ) A ] is 
finite. Show that 

Pr(|X - E[X]\ > t^/E[(X -E[X]) k ]) < 

(b) Why is it difficult to derive a similar inequality when k is odd? 


Exercise 3.21: A fixed point of a permutation n : [\,n] [1, /?] is a value for which 

丌 （ x) = z. Find the variance in the number of fixed points of a permutation chosen 
uniformly at random from all permutations. [Hint: Let X, be 1 if 7 i{i) = /, so that 
义 , .is the number of fixed points. You cannot use linearity to find Var[^X, ], 
but you can calculate it directly.) 
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Exercise 3.22: Suppose that we flip a fair coin n times to obtain n random bits. Con¬ 
sider all m = (^) pairs of these bits in some order. Let Y, be the exclusive-or of the ith 
pair of bits, and let Y = 版 number of that equal 1. 

(a) Show that each Yj is 0 with probability 1/2 and 1 with probability 1 /2. 

(b) Show that the Y-, are not mutually independent. 

(c) Show that the Y, satisfy the property that Ef^^] = E[F,JE[}^]. 

(d) Using Exercise 3.15, find Var[F J. 

(e) Using Chebyshev's inequality, prove a bound on Pr(|K — E[X]| > n). 

Exercise 3.23: Generalize the median-finding algorithm for the case where the input 
5 is a multi-set. Prove that your resulting algorithm is correct, and bound its running 
time. 


Exercise 3.24: Generalize the median-finding algorithm to find the kth largest item in 
a set of n items for any given value of k. Prove that your resulting algorithm is correct, 
and bound its running time. 

Exercise 3.25: The weak law of large numbers states that, if X ]y X 2 , … are inde¬ 
pendent and identically distributed random variables with mean fi and standard devia¬ 
tion cr, then for any constant £ > 0 we have 

- pr ( ㈣ + 

"一 oc y n 

Use Chebyshev’s inequality to prove the weak law of large numbers. 
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Chernoff Bounds 


This chapter introduces what are commonly called Chernoff bounds, Chernoff bounds 
are extremely powerful, giving exponentially decreasing hounds on the tail distribu¬ 
tion. These bounds are derived by using Markov's inequality on the moment gener¬ 
ating function of a random variable. We start this chapter bv defining and discussing 
the properties of the moment generating function. We then derive Chernoff bounds for 
the binomial distribution and other related distributions, using a set-balancing prob¬ 
lem as an example. To demonstrate the power of Chernoff bounds, we apply them 
to the analysis of randomized packet routing schemes on the hypercube and butterfly 
networks. 


4.1. Moment Generating Functions 

Before developing Chernoff bounds, we discuss the special role of the moment gener¬ 
ating function E[e’ x ]. 

Definition 4.1: The moment generating function of a random variable X is 

M x (t) = E[e rX ]. 

We are mainly interested in the existence and properties of this function in the neigh¬ 
borhood of zero. 

The function M x (t ) captures all of the moments of X. 

Theorem 4.1: Let X be a random variable with moment generating function M x (t). 
Under the assumption that exchanging the expectation and differentiation operands is 
legitimate, for all n > \ we then have 

E[X ,( ] = 

where M ( x n) (0) is the nth derivative of M x ( t ) evaluated at t = 0. 
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Proof: Assuming that we can exchange the expectation and differentiation operands, 
then 

M^ l) {t) = E[X"e tX ]. 

Computed at r = 0, this expression yields 

M^'(0) = E[X n l ■ 

The assumption that expectation and differentiation operands can be exchanged holds 
whenever the moment generating function exists in a neighborhood of zero, which will 
be the case for all distributions considered in this book. 

As a specific example, consider a geometric random variable X with parameter p, 
as in Definition 2.8. Then, for t < —ln(l — p). 

M x (t)=E[t !X ] 

= - p 、 k -W k 

k=\ 

_ P 
_ 1 - /? 

= 7 ^-((l-(l-p)e ? )- 1 -l). 

1-/7 

It follows that 

M ( x l) (t) = p{\ — (1 — /?)e’) _2 e’ and 

M { x 2) (t) = 2p(l — p)(\ — (1 — /?)e’r 3 e 2 ’ + /?(1 — (1 — 尸 ) e’ ） — 2 e’. 

Evaluating these derivatives at r = 0 and using Theorem 4.1 gives E[X] = \/p and 
E[X 2 ] = (2 — p)/p 2 , matching our previous calculations from Section 2.4 and Sec¬ 
tion 3.3.1. 

Another useful property is that the moment generating function of a random variable 
(or, equivalently, all of the moments of the variable) uniquely defines its distribution. 
However, the proof of the following theorem is beyond the scope of this book. 

Theorem 4.2: Let X and Y be two random variables. If 

M x (t) — My{t) 

for all t e (—5, 5) for some 5 > 0, then X and Y have the same distribution. 

One application of Theorem 4.2 is in determining the distribution of a sum of indepen¬ 
dent random variables. 

Theorem 4.3: If X and Y are independent random variables, then 
Mx+y(t) — Mx(t) Myit). 
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Proof: 

M x+Y (t) = E[e’ u+n ] = E[e tX & tY ] = E[e^]E[e /y ] = M x (t)M Y (t). 

Here we have used that X and Y are independent - and hence e tX and t tY are indepen¬ 
dent - to conclude that E[e rX e ,r ] = E[e’ x ]E[e /y ']. ■ 

Thus, if we know M x (t) and M Y (t) and if we recognize the function M x (t) My(t) as 
the moment generating function of a known distribution, then that must be the distribu¬ 
tion of X + y when Theorem 4.2 applies. We will see examples of this in subsequent 
sections and in the exercises. 


4.2. Deriving and Applying Chernoff Bounds 


The Chernoff bound for a random variable X is obtained b\ applying Markov’s in¬ 
equality to e’ x for some well-chosen value t. From Markov's inequality, we can derive 
the following useful inequality: for any t > 0, 


E[e A 1 

Pr(X >a) = Pr(e rf > e^) < — ~ - 


In particular, 


Similarly, for any r < 0, 


Pr(X>,)< m in E[ ^ ] 

/>() e ；t/ 


Pr(X <a) = Pr(e^ > e’“）< 


E[e ,;r ] 


Hence 


Pr(X < a) < min 


E[e r ) 


：o 


Bounds for specific distributions are obtained by choosing appropriate values for t. 
While the value of t that minimizes E[e ： tX ]/c ul gives the best possible bounds, often 
one chooses a value of t that gives a convenient form. Bounds derived from this ap¬ 
proach are generally referred to collectively as Chernoff bounds. When we speak of a 
Chernoff bound for a random variable, it could actually be one of many bounds derived 
in this fashion. 


4.2.1. Chernoff Bounds for the Sum of Poisson Trials 

We now develop the most commonly used version of the Chernoff bound: for the tail 
distribution of a sum of independent 0-1 random variables, which are also known as 
Poisson trials ‘ (Poisson trials differ from Poisson random variables, which will be dis¬ 
cussed in Section 5.3.) The distributions of the random variables in Poisson trials are 
not necessarily identical. Bernoulli trials are a special case of Poisson trials where the 
independent 0-1 random variables have the same distribution; in other words, all trials 
are Poisson random variables that take on the value 1 with the same probability. Also 
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recall that the binomial distribution gives the number of successes in n independent 
Bernoulli trials. Our Chernoff bound will hold for the binomial distribution and also 
for the more general setting of the sum of Poisson trials. 

Let X\, . X n be a sequence of independent Poisson trials with Pr(X,- = 1) = p,. 
Let X = XX, X,，and let 

「 fi -j n n 

/i = E[X] = E =E E[X , ] = I] 内 . 

L / = 1 J /. 二 1 / = 1 

For a given 5 > 0, we are interested in bounds on Pr(X > (1 4 - S)/i) and Pr(X < 
(1 — 5)/x) - that is, the probability that X deviates from its expectation /i by 8/j. or more. 
To develop a Chernoff bound we need to compute the moment generating function of 
X. We start with the moment generating function of each Xj ： 

M Xi (t) = E[e tX '] 

= PjQ f + (1 — /?；) 

=1 + pi(e r - 1 ) 

< e" ,(e ’ — 丨 1 , 


where in the last inequality we have used the fact that, for any v, 1 + y < e- v . Applying 
Theorem 4.3, we take the product of the n generating functions to obtain 


U 

M x {t) = M x , ( t) 


/ 二 ：1 


=ex 


4 


厂 /(e r — 1) 


=e 


(c r - 1 )// 


Now that we have determined a bound on the moment generating function, we are 
ready to develop concrete versions of the Chernoff bound for a sum of Poisson trials. 
We start with bounds on the deviation above the mean. 


Theorem 4.4: Let Xi,.... be independent Poisson trials such that Pr(X/) = p,. 
Let X = Xj and /i = E[X]. Then the following Chernoff bounds hold: 


1. for any & > 0, 

2. for 0 < 5 < 1, 

3. for R > 6/i, 


Pr(X > (l + 5)/i) < 




Pr(X > (1 + 5)/x) < e—W 2/3 

?v(X > R) < 2~ r . 
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The first bound of the theorem is the strongest, and it is from this bound that we derive 
the other two bounds, which have the advantage of being easier to state and compute 
with in many situations. 

Proof: Applying Markov’s inequality, for any r > 0 we have 
Pr(X > (1 + 5)/x) = Pr(e fY > e f(1+5)M ) 

E[e rX ] 


< 


< 


、 f ( 1 + ^ )/.^ 

、 (e’ 一 l)/< 


0 r( 1+6)" 

For any 5 > 0, we can set t = ln(l + S) > 0 to get (4.1): 

/ e d 、 

Pr(X > (1 + < — ^—— 

To obtain (4.2) we need to show that, forO < ^ < 1, 


< e 


(l + S) (l+S) 

Taking the logarithm of both sides, we obtain the equivalent condition 

f(S) = S-(l+S)ln(l + S) + j < 0. 
Computing the derivatives of f(S), we have: 

_ 

— m 

■ln(l +5) + 


.m) 


ln(l + 5) + 


2 


/" ⑹ 


+ 


2 


We see that f"{S) < 0 for 0 < 5 < 1/2 and that f"{8) > 0 for 6 > 1/2. Hence 
f'(8) first decreases and then increases over the interval [0,1]. Since / '(0) = 0 and 
f'(\) < 0, we can conclude that f r (8) < 0 in the interval [0,1]. Since /(()) = 0. it 
follows that f(8) < 0 in that interval, proving (4.2). 

To prove (4.3), let 尺 =(1 + 5)/x. Then, for R > 6/x, 8 = R/(i — 1 > 5. Hence, 
using (4.1), 

Pr( ^ (1 + ^^(a4^y 

/ 。 \( 1 + 


4 ) 

< 2 ~ r . 
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We obtain similar results bounding the deviation below the mean. 

Theorem 4.5: Let X\,X„ be independent Poisson trials such that Pr(X,) = /?,■. 
Let X = X, and fi = E[X]. Then, for 0 < 5 < 1: 

1 - Pr(X<(l-^)<( (1 ^y rr ^y ； (4.4) 

2. Pr(X < (1 - 8)ii) < q- 11 ^ 2 . (4.5) 


Again, the bound of Eqn. (4.4) is stronger than Eqn. (4.5), but the latter is generally 
easier to use and sufficient in most applications. 

Proof: Using Markov's inequality, for any t < 0 we have 
Pr(X < (1 - 5)/i) = Pr(e ,x > e r 
E[e LV ] 


< 


< 


1 一 6 )" 
P (e’ —1)" 


0^(1 — 8) I i 

For 0 < 5 < 1, we set t = ln(l — 5) < 0 to get (4.4): 


Pr(X < (1 - 8 )ii ) < 


V (1 - 5)(") 

To prove (4.5) we must show that, for 0 < 6 < 1, 


(1 - < 5 )"- 


< e 


-f/2 


Taking the logarithm of both sides, w e obtain the equivalent condition 


f(8) = -8-(\- 6)ln(l -5) + y <0 


forO < ^ < 1. 


Differentiating f(8) yields 


f'(8) — \n(\ — 8) 8, 

/〃 ⑻ =- 士 + l . 

Since / 〃 (<5) < 0 in the range (0.1) and since / "(0) = 0, we have f f (8) < 0 in the 
range [0,1). Therefore, f(8) is nonincreasing in that interval. Since /(0) = 0, it fol¬ 
lows that f(8) < 0 when 0 < 6 < 1. as required. ■ 


Often the following form of the Chernoff bound, which is derived immediately from 
Eqn. (4.2) and Eqn. (4.4), is used. 
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Corollary 4.6: Let X\,..X n be independent Poisson trials such that Pr(X；) = /7 ; . 
Let X = x ' and fi = E[X]. For 0 < 5 < 1, 

Pr(\X - /i\> 8 / 1 ) < 2t~ ll&2/ \ (4.6) 


In practice we often do not have the exact value of E[X]. Instead we can use /i > E[X] 
in Theorem 4.4 and /i < E[X] in Theorem 4.5 (see Exercise 4.7). 

4.2.2. Example: Coin Flips 


Let X be the number of heads in a sequence of n independent fair coin flips. Applying 
the Chernoff bound of Eqn. (4.6), we have 

/ n 1 , - \ f 1 /? 6 In /?] 

P V'2 2 — I 

' _ 2 

n 


This demonstrates that the concentration of the number of heads around the mean 
n /2 is very tight; most of the time, the deviations from the mean are on the order of 
0 (^/n \nn). 

To compare the power of this bound to Chebyshev’s bound, consider the probabil¬ 
ity of having no more than n/4 heads or no fewer than 3n/4 heads in a sequence of n 
independent fair coin flips. In the previous chapter, we used Chebyshev's inequality to 
show that 


n\ 4 

> 7 5 -• 

4 ) n 

Already, this bound is worse than the Chernoff bound just calculated for a significantly 
larger event! Using the Chernoff bound in this case, we find that 



Pr 


X - 


2eXp -324 


< 2e 


—n/24 


Thus, Chernoff’s technique gives a bound that is exponentially smaller than the bound 
obtained using Chebyshev's inequality. 


4.2.3. Application: Estimating a Parameter 


Suppose that we are interested in evaluating the probability that a particular gene muta¬ 
tion occurs in the population. Given a DNA sample, a lab test can determine if it carries 
the mutation. However, the test is expensive and we would like to obtain a relatively 
reliable estimate from a small number of samples. 
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Let p be the unknown value that we are trying to estimate. Assume that we have 
n samples and that X = pn of these samples have the mutation. Given a sufficiently 
large number of samples, we expect the value p to be close to the sampled value p. We 
express this intuition using the concept of a confidence interval. 

Definition 4.2: A l — y confidence interval for a parameter p is an interval [p — 8, 
p + 8] such that 

Pr(/? e[p - 8, p ^ 8]) > 1 - y. 

Notice that, instead of predicting a single value for the parameter, we give an interval 
that is likely to contain the parameter. If p can take on any real value, it may not make 
sense to try to pin down its exact value from a finite sample, but it does make sense to 
estimate it within some small range. 

Naturally we want both the interval size 28 and the error probability y to be as small 
as possible. We derive a trade-off between these two parameters and the number of 
samples n. In particular, given that among n samples (chosen uniformly at random 
from the entire population) we find the mutation in exactly X = pn samples, we need 
to find values of 8 and y for which 

Pr(/? e [^ - 6, p + 6]) = ?r(np e [/;( p - S), /?(/; + 6)]) > 1 - y. 

Now X = np has a binomial distribution with parameters n and p, so E[X] = np. 
If p ^[p — 8, p + 8] then we have one of the following two events: 

1. if p < p — 8, then X — np > n( p + 8) = E[X](1 + S/p)', 

2. if p > p 8, then np < n( p — 8) = E[X](1 — 8 /p). 

We can apply the Chernoff bounds (4.2) and (4.5) to compute 

Pr(/^[y5-5,/3 + 5]) = Pr(X < /’/’(l — |)) + Pr(X > np(^\ + 吾 ))(4.7) 

< p)： 2 + e ~ ,, / ,(<5 // ,)2 / 3 * ( 4 , 8 ) 

=e- ， + e-(4.9) 

The bound given in Eqn. (4.9) is not useful because the value of p is unknown. A 
simple solution is to use the fact that p <\. yielding 

Pv(p^[f?-8,p + 8]) < e—^ /2 + t-’ 郝 . 

Setting y = e - "* 5 — /2 + or n& — 1 2 \ we obtain a trade-off between 8 , n, and the error proba¬ 
bility y. 

We can apply other Chernoff bounds, such as those in Exercises 4.13 and 4.16, to ob¬ 
tain better bounds. We return to the subject of parameter estimation when we discuss 
the Monte Carlo method in Chapter 10. 
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4.3. Better Bounds for Some Special Cases 

We can obtain stronger bounds using a simpler proof technique for some special cases 
of symmetric random variables. 

We consider first the sum of independent random variables when each variable as¬ 
sumes the value 1 or —1 with equal probability. 


Theorem 4.7: Let X\,.. .,X n be independent random variables with 
Pr(X ; - = 1) = Pr(X, = -1) = 

Let X — Yl'i=\ For any a > 0, 

Pr(X >a)< 

Proof: For any t > 0, 

E[e ， = 卜去 e' 

To estimate E[e ?x, ], we observe that 


and 


Q t = \ + t J r — — 

2! r. 


e~' = 1 - r + — + • • • + (-D , - + 
2 ! ;! 


using the Taylor series expansion for e’. Thus, 

E[e^] = ^e , + ^e- 


2 

E 

i>0 

E 

i>() 

J 2 /2 


( 2 /)! 

it 2 /2) 1 


Using this estimate yields 


E[t !X ] = P|E[e /X ] < e r：,;/ 


and 


F r P ^ 1 , 

Pr(X > a) = Pr(Q tX > t ta ) < < e r ” /2 — 
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Pv(X >a)< e-■ 

By symmetry we also have 

Pr(X < -a) < t~ a2/ln . 

Combining the two results yields our next corollary. 

Corollary 4.8: Let X\,X n be independent random variables with 

Pr(X,- = 1) = Pr(X/ = —1)=—. 

Let X = Yl'i=\ ^i- Then, for any a > 0, 

Pr(|X| >a) < 2e- 

Applying the transformation = (X,- + 1)/2 allows us to prove the following. 
Corollary 4.9: Let Y\,.. .,Y n be independent random variables with 

PrO ； = 1) = Pr(R = 0) = ^. 

LetY = Y i and ii = E[Y] = n/2. 

1. For any a > 0, 

Pr(F > fi^a) < e - 2 今 

2 . For any & > 0, 

Pr(F > (1 < c~ s2f \ (4.10) 

Proof: Using the notation of Theorem 4.7, we have 

Applying Theorem 4.7 yields 

Pr(y > ii + a) = Pr(X > 2a) < t~ 4c,2/2 \ 

proving the first part of the corollary. The second part follows from setting a — 8 {i — 
5/7/2. Again applying Theorem 4.7, we have 

Pr(y > (1 + 8 ) 11 ) = Pr(X > 2Sf!) < e — 25 V/” = e - s \ ■ 

Note that the constant in the exponent of the bound (4.10) is 1 instead of the 1/3 in the 
bound of (4.2). 

Similarly, we have the following result. 


70 




4.4 APPLICATION ： SET BALANCING 


Corollary 4.10: Let Y\,.. .,Y n be independent random variables with 
Pr ⑺ =1) = Pr(y f =◦) = ▲. 

Let Y = K and [i = E[K] = n/2. 

1. For any 0 <«<//, 

Pr(F < fi - a) < Q~ 2(r，n . 

2 . For any 0 < ^ < 1, 

Pr(y < (1 - 8 ) 11 ) < (4.11) 

4.4. Application: Set Balancing 

Given an n x m matrix A with entries in {0,1}, let 


/a u 

a 12 • 

• H\m \ 

(b x \ 


/Cl \ 

ai\ 

«22 • 

• aim 

b 2 

= 

(.2 

\a n \ 

a n i • 

• a nm ) 

\b m j 


\C n / 


Suppose that we are looking for a vector b with entries in { — 1,1} that minimizes 

IIA^loo = max |q|. 

/ — 1 n 

This problem arises in designing statistical experiments. Each column of the matrix A 
represents a subject in the experiment and each row represents a feature. The vector b 
partitions the subjects into two disjoint groups, so that each feature is roughly as bal¬ 
anced as possible between the two groups. One of the groups serves as a control group 
for an experiment that is run on the other group. 

Our randomized algorithm for computing a vector b is extremely simple. We ran¬ 
domly choose the entries of b, with Pr(^, = 1) = Pr(/?,- = —1) = 1/2. The choices 
for different entries are independent. Surprisingly, although this algorithm ignores the 
entries of the matrix A, the following theorem shows that ||A^|| !3C , is likely to be only 
Om \nn). This bound is fairly tight. In Exercise 4.15 you are asked to show that, 
when m = n, there exists a matrix A for which ||A 石 is for any choice of b. 

Theorem 4.11: For a random vector b with entries chosen independently and with 
equal probability from the set { — 1,1}, 

PrfHA^loo > V4m In n ) < 

Proof: Consider the ith row a t = and let k be the number of Is in that 

row. If ^ \m In n, then clearly \a- t - b\ = |q| < V 4ah In n. On the other hand, if 
k > y/4m In n then we note that the k nonzero terms in the sum 
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z,_ = a Ljbj 

7 = 1 


are independent random variables, each with probability 1/2 of being either +1 or — 1. 
Now using the Chernoff bound of Corollary 4.8 and the fact that m > k. 


Pr(|Z ; | > V4m In n ) < 2Q~ 4mlnn/2k < 


2 

^2 


By the union bound, the probability that the bound fails for any row is at most 2 /n. ■ 


4.5.* Application: Packet Routing in Sparse Networks 


A fundamental problem in parallel computing is how to communicate efficiently over 
sparse communication networks. We model a communication network by a directed 
graph on N nodes. Each node is a routing switch. A directed edge models a com¬ 
munication channel, which connects two adjacent routing switches. We consider a 
synchronous computing model in which (a) an edge can carry one packet in each time 
step and (b) a packet can traverse no more than one edge per step. We assume that 
switches have buffers or queues to store packets waiting for transmission through each 
of the switch’s outgoing edges. 

Given a network topology, a routing algorithm specifies, for each pair of nodes, a 
route - or a sequence of edges - connecting the pair in the network. The algorithm 
may also specify a queuing policy for ordering packets in the switches’ queues. For 
example, the First In First Out (FIFO) policy orders packets by their order of arrival. 
The Furthest To Go (FTG) policy orders packets in decreasing order of the number of 
edges they must still cross in the network. 

Our measure of the performance of a routing algorithm on a given network topology 
is the maximum time - measured as the number of parallel steps - required to route an 
arbitrary permutation routing problem, where each node sends exactly one packet and 
each node is the address of exactly one packet. 

Of course, routing a permutation can be done in just one parallel step if the network 
is a complete graph connecting all of the nodes to each other. Practical considerations, 
however, dictate that a network for a large-scale parallel machine must be sparse. Each 
node can be connected directly to only a few neighbors, and most packets must traverse 
intermediate nodes en route to their final destination. Since an edge may be on the path 
of more than one packet and since each edge can process only one packet per step, par¬ 
allel packet routing on sparse networks may lead to congestion and bottlenecks. The 
practical problem of designing an efficient communication scheme for parallel comput¬ 
ers leads to an interesting combinatorial and algorithmic problem: designing a family 
of sparse networks connecting any number of processors, together with a routing algo¬ 
rithm that routes an arbitrary permutation request in a small number of parallel steps. 

We discuss here a simple and elegant randomized routing technique and then use 
Chernoff bounds to analyze its performance on the hypercube network and the butterfly 
network. We first analyze the case of routing a permutation on a hypercube, a network 
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with N processors and 0(N log AO edges. We then present a tighter argument for the 
butterfly network, which has N nodes and only O(N) edges. 


4.5.1. Permutation Routing on the Hypercube 

Let AT = {0 < / < - 1} be the set of processors in our parallel machine and assume 

that N = 2 n for some integer n. Let x = (xi,... ,x„) be the binary representation of 
the number 0 < x < - 1. 

Definition 4.3: The n-dimensional hypercube {or n-cube) is a network with N — 2 n 
nodes such that node x has a direct connection to node v if and only if x and y differ in 
exactly one bit. 

See Figure 4.1. Note that the total number of directed edges in the /?-cube is 2nN. since 
each node is adjacent to n outgoing and n ingoing edges. Also, the diameter of the net¬ 
work is «; that is, there is a directed path of length up to n connecting any two nodes 
in the network, and there are pairs of nodes that are not connected by any shorter path. 

The topology of the hypercube allows for a simple bit-fixing routing mechanism, as 
shown in Algorithm 4.1. When determining which edge to cross next, the algorithm 
simply considers each bit in order and crosses the edge if necessary. 

Although it seems quite natural, using only the bit-fixing routes can lead to high 
levels of congestion and poor performance, as shown in Exercise 4.20. There are cer¬ 
tain permutations on which the bit-fixing routes behave poorly. It turns out, as we will 

73 



CHERNOFFBOUNDS 


w-Cube Bit-Fixing Routing Algorithm: 

1. Let a and b be the origin and the destination of the packet. 

2. For /• = 1 to do: 

(a) If cij / bj then traverse the edge (b\ __ /),_ 丨， a ,.，... ， a„) 一 

( 々 i. bi^\,bi,ai + u...,a n ). 


Algorithm 4.1: n-Cube bit-fixing routing algorithm. 


show, that these routes perform well if each packet is being sent from a source to a 
destination chosen uniformly at random. This motivates the following approach: first 
route each packet to a randomly chosen intermediate point, and then route it from this 
intermediate point to its final destination. 

It may seem unusual to first route packets to a random intermediate point. In some 
sense, this is similar in spirit to our analysis of Quicksort in Section 2.5. We found there 
that fora list already sorted in reverse order, Quicksort would take l^(/7 2 ) comparisons, 
whereas the expected number of comparisons for a randomly chosen permutation is 
only 0(n \ogn). Randomizing the data can lead to a better running time for Quicksort. 
Here, too, randomizing the routes that packets take - by routing them through a ran¬ 
dom intermediate point — avoids bad initial permutations and leads to good expected 
performance. 

The two-phase routing algorithm (Algorithm 4.2) is executed in parallel by all the 
packets. The random choices are made independently for each packet. Our analysis 
holds for any queueing policy that obeys the following natural requirement: if a queue 
is not empty at the beginning of a time step, some packet is sent along the edge associ¬ 
ated with that queue during that time step. We prove that this routing strategy achieves 
asymptotically optimal parallel time. 

Theorem 4.12: Given an arbitrary permutation routing problem, with probability 
1 一 0(iV _1 ) the two-phase routing scheme of Algorithm 4.2 routes all packets to their 
destinations on the n-cube in 0{n) = 0(log N) parallel steps. 


Proof: We first analyze the run-time of Phase I. To simplify the analysis we assume 
that no packet starts the execution of Phase II before all packets have finished the exe¬ 
cution of Phase I. We show later that this assumption can be removed. 

We emphasize a fact that we use implicitly throughout. If a packet is routed to a 
randomly chosen node i in the network, we can think of x = (xi,... ,x n ) as being 
generated by setting eachx, independently to be 0 with probability 1/2 and 1 with prob¬ 
ability 1/2. 

For a given packet M, let T\(M) be the number of steps for M to finish Phase I. For 
a given edge e, let X\(e) denote the total number of packets that traverse edge e during 
Phase I. 

In each step of executing Phase I, packet M is either traversing an edge or waiting in a 
queue while some other packet traverses an edge on A/’s route. This simple observation 
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Two-Phase Routing Algorithm: 

Phase I — Route the packet to a randomly chosen node in the network using the 
bit-fixing route. 

Phase II — Route the packet from its random location to its final destination using 
the bit-fixing route. 


Algorithm 4.2: Two-phase routing algorithm. 

relates the routing time of M to the total number of packet transitions through edges 
on the path of M, as follows. 


Lemma 4.13: Let e \,..., e m be the m < n edges traversed by a packet M in Phase I. 
Then 


T^M) < 


Let us call any path P = {e\,e 2 , ..., e m ) of m < n edges that follows the bit-fixing 

algorithm a possible packet path. We denote the corresponding nodes by r (l . r i. v ni 

with ei = Vi). Following the definition of T\(M), for any possible packet path 
P we let 


i =] 

By Lemma 4.13, the probability that Phase I takes more than T steps is bounded by 
the probability that, for some possible packet path P, T](P) > T. Note that there are 
at most 2 n - 2" = 2 ln possible packet paths, since there are 2 n possible origins and 2 " 
possible destinations. 

To prove the theorem, we need a high-probability bound on T\{P). Since T\(P) 
equals the summation 义 1 ⑹， it would be natural to try to use a Chernoff bound. 
The difficulty here is that the X\(ei) are not independent random variables, since a 
packet that traverses an edge is likely to traverse one of its adjacent edges. To circum¬ 
vent this difficulty, we first use a Chernoff bound to prove that, with high probability, no 
more than 6« different packets cross any edge of P. We then condition on this event to 
derive a high-probability bound on the total number of transitions these packets make 
through edges of the path P, again using a Chernoff bound. 1 

Let us now fix a specific possible packet path P with m edges. To obtain a high- 
probability bound on the number of packets that cross an edge of P, let us call a packet 
active at a node u,_i on the path P if it reaches u/_i and has the possibility of crossing 

1 This approach overestimates the time to finish a phase. In fact, there is a deterministic argument showing that, 
in this setting, the delay of a packet on a path is bounded by the number of different packets that traverse edges 
of the path, and hence there is no need to bound the total number of traversals of these packets on the path. 
However, in the spirit of this book we prefer to present the probabilistic argument. 
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edge e, to v/. That is, if u,-_i and v,- differ in the /th bit then — in order for a packet to 
be active at u,_i - its jth bit cannot have been fixed by the bit-fixing algorithm when it 
reaches 1 ^_卜 We may also call a packet active if it is active at some vertex on the path 
P. We bound the total number of active packets. 

For k = 1 ， •.. ， iV, let //左 be a 0-1 random variable such that // 々 =1 if the packet 
starting at node k is active and = 0 otherwise. Notice that the are indepen¬ 
dent because (a) each H k depends only on the choice of the intermediate destination of 
the packet starting at node k and (b) these choices are independent for all packets. Let 
H = be the total number of active packets. 

We first bound E[H]. Consider all the active packets at u,-_i. Assume that = 
(b\, bj-\, cij, a J+l , a n ) and v t — (b\, bj— h bj, a j+ u a„). Then only 

packets that start at one of the addresses (*,_.. .,a n ), where * stands for either 

a 0 or a 1, can reach u, _i before the y'th bit is fixed. Similarly, each of these packets ac¬ 
tually reaches only if its random destination is one of the addresses (b\,... 

Thus, there are no more than 2 y_1 possible active packets at v；^\, and the 
probability that each of these packets is actually active at u,-_i is 2 _(7_1) . Hence the ex¬ 
pected number of active packets per vertex is 1 and, since we need only consider the m 
vertices uo, • • •, v m -\, it follows by linearity of expectations that 

E[H] < m ■ \ < n. 


Since H is the sum of independent 0-1 random variables, we can apply the Chemoff 
bound (we use (4.3)) to prove 

Pr(H >6n> 6E[//J) < 2- 6 ' 

The high-probability bound for H can help us obtain a bound for 7"i(P) as follows. 
Using 

Pr(A) = Pr(A I B)Pr(B) + Pr(A j B)Pv(B) 

< ?Y(B) + Pr(A I B), 


we find for a given possible packet path P that 


Pr(r,(P) > 30/?) < Pv(H > 6/?) +Pr(ri(P) > 30n \ H < 6n) 
< 2— 6 " + Pr(r,(/ ? ) >30n\H< 6n). 


Hence if we show 


Pr(r,(P) > 30/7 \ H <6n) < 2 - 3 ”- 1 ， 

we then have 

Pr(r,(P) > 30n) < 2—' 
which proves sufficient for our purposes. 

We therefore need to bound the conditional probability Pr(7i(P) > 30n | H < 6n). 
In other words, conditioning on having no more than 6n active packets that might use 
edges of P, we need a bound on the total number of transitions that these packets take 
through edges of P. 
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We first observe that, if a packet leaves the path, it cannot return to that path in this 
phase of the routing algorithm. Indeed, assume that the active packet was at Vj and 
that it moved to w ^ v i+ \. The smallest index bit in which v i+ \ and w differ cannot be 
fixed later in this phase, so the route of the packet and the path P cannot meet again in 
this phase. 

Now suppose we have an active packet on our path P at node Vi. What is the prob¬ 
ability that the packet crosses Let us think of our packet as fixing the bits in the 
binary representation of its destination one at a time by independent random coin flips. 
The nodes of the edge e ； differ in one bit (say, the /th bit) in this representation. It is 
therefore clear that the probability of the packet crossing edge ^ is at most 1 / 2 , since 
to cross this edge it must choose the appropriate value for the 7 th bit. (In fact, the prob¬ 
ability might be less than 1 / 2 ; the packet might cross some other edge before choosing 
the value of the jth bit.) 

To obtain our bound, let us view as a trial each point in the algorithm where an active 
packet at a node v ； on the path P might cross edge e t . The trial is successful if the packet 
leaves the path but a failure if the packet stays on the path. Since the packet leaves the 
path on a successful trial, if there are at most 6 « active packets then there can be at most 
6 n successes. Each trial is successful, independently, with probability at least 1/2. The 
number of trials is itself a random variable, which we use in our bound of T\(P). 

We claim that the probability that the active packets cross edges of P more than 30« 
times is less than the probability that a fair coin flipped 36n times comes up heads fewer 
than 6 n times. To see this, think of a coin being flipped for each trial, with heads corre¬ 
sponding to a success. The coin is biased to come up heads with the proper probability 
for each trial, but this probability is always at least 1/2 and the coins are independent 
for each trial. Each failure (tails) corresponds to an active packet crossing an edge, but 
once there have been 6 « successes we know there are no more active packets left that 
can cross an edge of the path. Using a fair coin instead of a coin possibly biased in favor 
of success can only lessen the probability that the active packets cross edges of P more 
than 30n times, as can be shown easily by induction (on the number of biased coins). 

Letting Z be the number of heads in 36n fair coin flips, we now apply the Chernoff 
bound (4.5) to prove: 

Pr(r,(P) > 30n I H < 6n) < Pr(Z < 6 n) < e - 18 «( 2 ’ 3 ) 2 / 2 = e - 4 " 5 2 ->-1 


It follows that 


Pr(r,(P) > 30n) < Pr(H > 6 n) + Pr(r,(P) >30n\H<6n)< 2 3 ". 

as we wanted to show. Because there are at most 2 ln possible packet paths in the hy¬ 
percube, the probability that there is any possible packet path for which T\(P) > 30n 
is bounded by 


2 2>, 2~ }n = 2~ n = 0(N~ l ). 

This completes the analysis of Phase I. Consider now the execution of Phase II, as¬ 
suming that all packets completed their Phase I route. In this case, Phase II can be 
viewed as running Phase I backwards: instead of packets starting at a given origin and 
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going to a random destination, they start at a random origin and end at a given des¬ 
tination. Hence no packet spends more than 30/? steps in Phase II with probability 

1 — 0{N-\). 

In fact, we can remove the assumption that packets begin Phase II only after Phase I 
has completed. The foregoing argument allows us to conclude that the total number of 
packet traversals across the edges of any packet path during Phase I and Phase II to¬ 
gether is bounded by 60/7 with probability 1 — 0(N^ ] ). Since a packet can be delayed 
only by another packet traversing that edge, we find that every packet completes both 
Phase I and Phase II after 60n steps with probability 1 — regardless of how the 

phases interact, concluding the proof of Theorem 4.12. ■ 

Note that the run-time of the routing algorithm is optimal up to a constant factor, since 
the diameter of the hypercube is n. However, the network is not fully utilized because 
2nN directed edges are used to route just N packets. At any give time, at most \/2n 
of the edges are actually being used. This issue is addressed in the next section. 


4.5.2. Permutation Routing on the Butterfly 


In this section we adapt the result for permutation routing on the hypercube networks 
to routing on butterfly networks, yielding a significant improvement in network utiliza¬ 
tion. Specifically, our goal in this section is to route a permutation on a network with 
N nodes and O(N) edges in 0(log iV) parallel time steps. Recall that the hypercube 
network had N nodes but log") edges. Although the argument will be similar 
in spirit to that for the hypercube network, there is some additional complexity to the 
argument for the butterfly network. 

We work on the wrapped butterfly network, defined as follows. 


Definition 4.4: The wrapped butterfly network has N — n2 n nodes. The nodes are 
arranged in n columns and 2 n rows, A node's address is a pair (x, r), where 1 < a < 
2" is the row number and 0 </'</? — 1 is the column number of the node. Node (j, r) 
is connected to node ( v. .v) if and only if s = r + 1 modn and either: 

1. x = y (the “direct” edge): or 

2 . .r andy differ in precisely the 、'th bit in their binary representation (the “flip” edge). 

See Figure 4.2. To see the relation between the wrapped butterfly and the hypercube, 
observe that by collapsing the n nodes in each row of the wrapped butterfly into one 
“super node” we obtain an /z-cube network. Using this correspondence, one can eas¬ 
ily verify that there is a unique directed path of length n connecting node (x, r) to any 
other node (w, r) in the same column. This path is obtained by bit fixing: first fixing 
bits r + 1 to then bits 1 to r. See Algorithm 4.3. Our randomized permutation routing 
algorithm on the butterfly consists of three phases, as shown in Algorithm 4.4. 

Unlike our analysis of the hypercube, our analysis here cannot simply bound the 
number of active packets that possibly traverse edges of a path. Given the path of a 
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Wrapped Butterfly Bit-Fixing Routing Algorithm: 


1. Let (x, r) and (v, r) be the origin and the destination of a packet. 

2. For i = 0 to « — 1, do: 

(a) _/■ = ((/ 十 r) modn) + 1; 

(b) if aj = bj then traverse the direct edge to column j mod n. else traverse 
the flip edge to column j modn. 


Algorithm 4.3: Wrapped butterfly bit-fixing routing algorithm. 


packet, the expected number of other packets that share edges with this path when rout¬ 
ing a random permutation on the butterfly network is ^2 {n 2 ) and not 0 (n) as in the 
n-cube. To obtain an 0{n) routing time, we need a more refined analysis technique 
that takes into account the order in which packets traverse edges. 

Because of this, we need to consider the priority policy that the queues use when 
there are several packets waiting to use the edge. A variety of priority policies would 
work here; we assume the following rules. 
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Three-Phase Routing Algorithm: 

For a packet sent from node (a-, r) to node (y, .v): 

Phase I - Choose a random u; e [1,..., 2 n ], Route the packet from node (x, r) to 
node (w, r) using the bit-fixing route. 

Phase II - Route the packet to node ( w, s) using direct edges. 

Phase III — Route the packet from node {w, s) to node (v, i') using the bit-fixing 
route. 


Algorithm 4.4: Three-phase routing algorithm. 


1. The priority of a packet traversing an edge is (i — l)n 十 where i is the current 
phase of the packet and t is the number of edge traversals the packet has already 
executed in this phase. 

2. If at any step more than one packet is available to traverse an edge, the packet with 
the smallest priority number is sent first. 


Theorem 4.14: Given an arbitrary permutation routing problem on the wrapped but¬ 
terfly with N = n2 n nodes, with probability \ — 0{N ~ l ) the three-phase routing scheme 
of Algorithm 4.4 routes all packets to their destinations in 0{n) — 0(\ogN) parallel 
steps. 


Proof: The priority rule in the edge queues guarantees that packets in a phase cannot 
delay packets in earlier phases. Because of this, in our forthcoming analysis we can 
consider the time for each phase to complete separately and then add these times to 
bound the total time for the three-phase routing scheme to complete. 

We begin by considering the second phase. We first argue that with high probability 
each row transmits at most packets in the second phase. To see this, let X U! be the 
number of packets whose intermediate row choice is w in the three-phase routing algo¬ 
rithm. Then X w is the sum of 0-1 independent random variables, one for each packet, 
and E[X„,] = n. Hence, we can directly apply the Chernoff bound (4.3) to find 

Pr(U4”)5 ⑸ £3- 2 ' 

There are 2" possible rows w. By the union bound, the probability that any row has 
more than An packets is only 2 n - 3— 2 " = 0(N~ l ). 

We now argue that, if each row has at most An packets for the second phase, then the 
second phase takes at most 5n steps to complete. Combined with our previous observa¬ 
tions, this means the second phase takes at most 5n steps with probability 1 — 0(N~ l ). 
To see this, note that in the second phase the routing has a special structure: each packet 
moves from edge to edge along its row. Because of the priority rule, each packet can 
be delayed only by packets already in a queue when it arrives. Therefore, to place an 
upper bound on the number of packets that delay a packet p, we can bound the total 
number of packets found in each queue when p arrives at the queue. But in Phase II, 
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the number of other packets that an arriving packet finds in a queue cannot increase 
in size over time, since at each step a queue sends a packet and receives at most one 
packet. (It is worth considering the special case when a queue becomes empty at some 
point in Phase II; this queue can receive another packet at some later step, but the num¬ 
ber of packets an arriving packet will find in the queue after that point is always zero.) 
Since there are at most An packets total in the row to begin with, p finds at most An 
packets that delay it as it moves from queue to queue. Since each packet moves at most 
n times in the second phase, the total time for the phase is 5n steps. 

We now consider the other phases. The first and third phases are again the same by 
symmetry, so we consider just the first phase. Our analysis will use a delay sequence 
argument. 


Definition 4.5: A delay sequence/or an execution of Phase / is a sequence ofn edges 
e\,.. .,e n such that either e\ — ej + \ or e^\ is an outgoing edge from the end vertex of 
. The sequence e\,.. .,e n has the further property that e, is {one of ) the last edges to 
transmit packets with priority up to i among e/ + \ and the two incoming edges of e[ + \. 

The relation between the delay sequence and the time for Phase I to complete is given 
by the following lemma. 


Lemma 4.15: For a given execution of Phase I and delay sequence e\ . e„. let r, 

be the number of packets with priority i sent through edge e,. Let T, be the time that 
edge ei finishes sending all packets with priority number up to i. so that T„ is the ear¬ 
liest time at which all packets passing through e n during Phase I have passed through 
it. Then: 

/. T n <Y ： Ut t . 

2. If the execution of Phase I takes T steps, then there is a delay sequence for this exe¬ 
cution for which H h t T. 

Proof: By the design of the delay sequence, at time 7) the queue already holds 

all of the packets that it will need to subsequently transmit with priority / + 1, and at 
that time it has already finished transmitting all packets with priority numbers up to /. 
Thus, 

Ti+\ < + t i+ \. 

Since T\ = ?i, we have 


T n S T n —\ + 

< T n -2 + t„^i + t n 

n 

必， 

/=i 

proving the first part of the lemma. 

For the second part, assume that Phase I took T steps and let e be an edge that trans¬ 
mitted a packet at time T. We can construct a delay sequence with e n = e by choosing 
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e n -\ to be the edge among e and its two incoming edges that last transmits packets of 
priority n — \, and similarly choosing e n —2 down to q. By the first part of the lemma, 

T ： Uu>t. ' □ 


Returning to the proof of Theorem 4.14, we now show that the probability of a de¬ 
lay sequence with T > 40« is only 0(N~ l ). We call any sequence of edges e\, ...,e n 
such that either = e- l+ \ or e- l+ \ is an outgoing edge from the end vertex of ei a possi¬ 
ble delay sequence. For a given execution and a possible delay sequence, let r, be the 
number of packets with priority i sent through e t . Let T = H We first bound 
E[7]. Consider the edge ei = v —> v r . Packets with priority / pass through this edge 
only if their source is at distance i — 1 from v. There are precisely 2 l ~ ] nodes that are 
connected to u by a directed path of length / — 1. Since packets are sent in Phase I to 
random destinations, the probability that each of these nodes sends a packet that tra¬ 
verses edge ei is 2—’，giving 

E[? ( ] = 2 , " 1 2~ , = 1 - and E[T] = 

The motivation for using the delay sequence argument should now be clear. Each 
possible delay sequence defines a random variable T, where E[T] = n/2. The max¬ 
imum of T over all delay sequences bounds the run-time of the phase. So we need 
a bound on T that holds with sufficiently high probability to cover all possible delay 
sequences. A high-probability bound on T can now be obtained using an argument 
similar to the one used in the proof of Theorem 4.12. We first bound the number of 
different packets that contribute to edge traversals counted in T. 

For j = \..., N,\ct Hj = 1 if any traversal of the packet sent by node j is counted 
in T: otherwise, H j = 0. Clearly, H = H i < T and E[//] < E|T] = n/2, 
where the Hj are independent random variables. Applying the Chernoff bound (4.3) 
therefore yields 

Pr(H > 5n) < 2— 

Conditioning on the event H < 5n. we now proceed to prove a bound on 7, follow¬ 
ing the same line as in the proof of Theorem 4.12. Given a packet u with at least one 
traversal counted in T, we consider how many additional traversals of u are counted in 
T. Specifically, if u is counted in t, then we consider the probability that it is counted 
in t i+ 1 . We distinguish between two cases as follows. 

1. If et + \ = ei then u cannot be counted in t i+ \, since its traversal with priority / + 1 is 
in the next column. Similarly, it cannot be counted in any t n j > i. 

2. If e i+ \ e,, then the probability that u continues through e； + \ (and is counted in 
t i+ \) is at most 1/2. If it does not continue through a + \, then it cannot intersect with 
the delay sequence in any further traversals in this phase. 


As in the proof of Theorem 4.12, the probability that T > 40n is less than the prob¬ 
ability that a fair coin flipped 40/? times comes up heads fewer than 5n times. (Keep in 
mind that, in this case, the first traversal by each packet in H must be counted as con¬ 
tributing to T.) Letting Z be the number of heads in 40« fair coin flips, we now apply 
the Chernoff bound (4.5) to prove 
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?r(T > 40n \ H < 5n) < Pr(Z < 5 a?) < e~ 20/K3/4)2/2 < 2— 5n . 


We conclude that 

Pr(7 > 40n) < Pr(T > 40n \ H < 5n) + Pr(H > 5n) < 2— 〜 十 1 . 

There are no more than 2N3 n ^ 1 < n2 n 3 n possible delay sequences. Thus, the prob¬ 
ability that, in the execution of Phase I, there is a delay sequence with T > 40n is 
bounded above (using the union bound) by 


n2 n 3 ,l 2~ Sn+] < 0{N-'). 

Since Phase III is entirely similar to Phase I and since Phase II also finishes in O(n) 
steps with probability 1 - 0 (N ] ), we have that the three-phase routing algorithm fin¬ 
ishes in O(n) steps with probability 1 — 0 、 N- 1 、. ■ 

4.6. Exercises 

Exercise 4.1: Alice and Bob play checkers often. Alice is a better player, so the proba¬ 
bility that she wins any given game is 0.6, independent of all other games. They decide 
to play a tournament of n games. Bound the probability that Alice loses the tournament 
using a Chernoff bound. 


Exercise 4.2: We have a standard six-sided die. Let X be the number of times that 
a 6 occurs over n throws of the die. Let p be the probability of the event X > n /4. 
Compare the best upper bounds on p that you can obtain using Markov's inequality. 
Chebyshev’s inequality, and Chernoff bounds. 

Exercise 4.3: (a) Determine the moment generating function for the binomial random 
variable B(n, p). 

(b) Let X be a B{n, p) random variable and Y a B(m, p) random variable, where X 
and Y are independent. Use part (a) to determine the moment generating function of 

x + r. 

(c) What can we conclude from the form of the moment generating function of 
x 十 r? 

Exercise 4.4: Determine the probability of obtaining 55 or more heads when flipping 
a fair coin 100 times by an explicit calculation, and compare this with the Chernoff 
bound. Do the same for 550 or more heads in 1000 flips. 


Exercise 4.5: We plan to conduct an opinion poll to find out the percentage of people 
in a community who want its president impeached. Assume that every person answers 
either yes or no. If the actual fraction of people who want the president impeached is 
p, we want to find an estimate X of p such that 

?r(\X - p\ < sp) > \ - S 


for a given e and <5, with 0 < e, (5 < 1. 
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We query N people chosen independently and uniformly at random from the com¬ 
munity and output the fraction of them who want the president impeached. How large 
should N be for our result to be a suitable estimator of /?? Use Chemoff bounds, and 
express N in terms of p, s, and <5. Calculate the value of N from your bound if s = 0.1 
and 8 — 0.05 and if you know that p is between 0.2 and 0.8. 

Exercise 4.6: (a) In an election with two candidates using paper ballots, each vote 
is independently misrecorded with probability p = 0.02. Use a Chemoff bound to 
bound the probability that more than 4% of the votes are misrecorded in an election of 
1 , 000,000 ballots. 

(b) Assume that a misrecorded ballot always counts as a vote for the other candi¬ 
date. Suppose that candidate A received 510,000 votes and that candidate B received 
490,000 votes. Use Chernoff bounds to bound the probability that candidate B wins 
the election owing to misrecorded ballots. Specifically, let X be the number of votes 
for candidate A that are misrecorded and let Y be the number of votes for candidate B 
that are misrecorded. Bound Pr((X > k) D (Y < £)) for suitable choices of k and i. 

Exercise 4.7: Throughout the chapter we implicitly assumed the following extension 
of the Chernoff bound. Prove that it is true. 

Let X = Xi, where the X, are independent 0-1 random variables. Let /x = 
E[X]. Choose any fi L and ji H such that /iz. < M < Mw- Then, for any <5 > 0, 

Pr(X 2 (1 十 8 )jd^) < 

Similarly, for any 0 < <5 < 1, 

Pr(X < (1 - ) < 

Exercise 4.8: We show how to construct a random permutation tt on [l,n], given a 
black box that outputs numbers independently and uniformly at random from [ 1 , k] 
where k > n. If we compute a function /: [ 1 ,«] -> [ 1 ,/:] with/(/) 7 ^/(y) for/^ j, 
this yields a permutation: simply output the numbers [\,n] according to the order of the 
/(/) values. To construct such a function /, do the following for j = .,n: choose 

f(j) by repeatedly obtaining numbers from the black box and setting f(j) to the first 
number found such that f(j) ^ /( i) for i < j. 

Prove that this approach gives a permutation chosen uniformly at random from all 
permutations. Find the expected number of calls to the black box that are needed when 
k = n and k = 2n. For the case k = 2n, argue that the probability that each call to 
the black box assigns a value of f(j) to some j is at least 1/2. Based on this, use a 
Chernoff bound to bound the probability that the number of calls to the black box is at 
least 4n. 

Exercise 4.9: Suppose that we can obtain independent samples X\,X 2 ,... of a ran¬ 
dom variable X and that we want to use these samples to estimate E[X], Using t 


/ y 丫 " 

\(1 + <5) (l+l5) J 

/ e- 15 
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samples, we use ( Xj)/t for our estimate of E[X]. We want the estimate to be 
within eE[X] from the true value of E[X] with probability at least 1 — S. We may not 
be able to use Chemoff’s bound directly to bound how good our estimate is if X is not 
a 0-1 random variable, and we do not know its moment generating function. We de¬ 
velop an alternative approach that requires only having a bound on the variance of X. 
Let r — y\^r[X]/E[X]. 

(a) Show using Chebyshev’s inequality that 0(r 2 /s 2 8) samples are sufficient to solve 
the problem. 

(b) Suppose that we need only a weak estimate that is within eE[X] of E[X] with 
probability at least 3/4. Argue that 0(r 2 /s 2 ) samples are enough for this weak 
estimate. 

(c) Show that, by taking the median of 0(log(l/5)) weak estimates, we can obtain an 
estimate within eE[X] of E[X] with probability at least 1 — 5. Conclude that we 
need only 0 ((r 2 log(l/<5))/e 2 ) samples. 

Exercise 4.10: A casino is testing a new class of simple slot machines. Each game, the 
player puts in $1, and the slot machine is supposed to return either $3 to the player with 
probability 4/25, $100 with probability 1/200, or nothing with all remaining probabil¬ 
ity. Each game is supposed to be independent of other games. 

The casino has been surprised to find in testing that the machines have lost $10,000 
over the first million games. Derive a Chernoff bound for the probability of this event. 
You may want to use a calculator or program to help you choose appropriate values as 
you derive your bound. 

Exercise 4.11: Consider a collection X\,...,X n of n independent integers chosen uni¬ 
formly from the set {0,1,2}. Let X = X,' and 0 < <5 < 1. Derive a Chemoff 

bound for Pr(X > (1 + 8)n) and Pr(X < (1 — 8)n). 

Exercise 4.12: Consider a collection X\,... ,X n of n independent geometrically dis¬ 
tributed random variables with mean 2. Let X = ^ ( w =| and <5 > 0. 

(a) Derive a bound on Pr(X > (1 + 8)(2n)) by applying the Chernoff bound to a 
sequence of (1 十 8)(2n) fair coin tosses. 

(b) Directly derive a Chernoff bound on Pr(X > (1 + 8)(2n)) using the moment gen¬ 
erating function for geometric random variables. 

(c) Which bound is better? 

Exercise 4.13: Let X\,...,X n be independent Poisson trials such that Pr{X 2 = 1)= 
p. LetX = x n so that E[X] = pn. Let 

F(x, p) = x \n(x/p) + (1 - x) ln((l - x)/(l - p)). 

(a) Show that, for 1 > x > /?, 

Pr(X >xn) <t- nFix ' p) . 

85 






CHERNOFFBOUNDS 


(b) Show that, when 0 < x, /? < 1, we have F{x, p) — 2(x — p ) 1 > 0. (Hint: Take 
the second derivative of F{x, p) — 2(x — p ) 1 with respect to x.) 

(c) Using parts (a) and (b), argue that 

Pr(X > (p + s)«) S e— 2 ” fi2 . 

(d) Use symmetry to argue that 

Pr(^ < (P ~ e)n) < e— 

and conclude that 

?r(\X - pn\ > sn) < 2e— 2 邮 2 . 

Exercise 4.14: Modify the proof of Theorem 4.4 to show the following bound for a 
weighted sum of Poisson trials. Let X],..., be independent Poisson trials such that 
Pr(X,) = pi and letai, ...,a n be real numbers in [0,1], Let X = J]. = 1 and fi = 
E[X]. Then the following Chernoff bound holds: for any <5 > 0, 

Fr(X - a ^ 8 w ~{ju^)- 

Prove a similar bound for the probability that X < (1 — 8 )fi for any 0 < <5 < 1. 

Exercise 4.15: Let X\, ...,X n be independent random variables such that 
Pr(X, = 1 — pi) — pi and Pr ( X 2 — — p t ) — \ 

LetX = Y.U\ X i- Prove 

Pr(|X| > a) < 2Q~ lal/n . 

Hint: You may need to assume the inequality 

仍 e ， .' 卜外 , 十 <1 —〜< e" 2/8 . 

This inequality is difficult to prove directly. 

Exercise 4.16: Let X]. X n be independent Poisson trials such that Pr(X ; ) = /?,. 

Let X = ciiXi and fi = E[XJ. Use the result of Exercise 4.15 to prove that, for 
any 0 < 5 < 1, 

Pr(|X - fi\> 8/i) < 2e- 

Exercise 4.17: Suppose that we have n jobs to distribute among m processors. For 
simplicity, we assume that m divides n. A job takes 1 step with probability p and k > 1 
steps with probability 1 — p. Use Chernoff bounds to determine upper and lower bounds 
(that hold with high probability) on when all jobs will be completed if we randomly 
assign exactly n jm jobs to each processor. 


86 






4.6 EXERCISES 


Exercise 4.18: In many wireless communication systems, each receiver listens on a 
specific frequency. The bit b(t) sent at time t is represented by a 1 or — 1. Unfortunately, 
noise from other nearby communications can affect the receiver’s signal. A simplified 
model of this noise is as follows. There are n other senders, and the ith has strength 
Pi < 1- At any time r, the ith sender is also trying to send a bit bi(t) that is represented 
by 1 or —1. The receiver obtains the signal s(t) given by 

n 

八 r) = MO 十 f 仍卜 ⑴ • 

/ =i 

If ^(r) is closer to 1 than —1, the receiver assumes that the bit sent at time t was a 1; 
otherwise, the receiver assumes that it was a —1. 

Assume that all the bits bi(t) can be considered independent, uniform random vari¬ 
ables. Give a Chernoff bound to estimate the probability that the receiver makes an 
error in determining b(t). 

Exercise 4.19: Recall that a function / is said to be convex if. for any .v! • and for 
0 < A < 1, 


/( 入 Xj 十 （1 — 人 ) 义 2) S 入 /( 义 1) 十 （1 — )• 


(a) Let Z be a random variable that takes on a (finite) set of values in the interval [0,1], 
and let p = E[Z]. Define the Bernoulli random variable X by Pr( X = 1) — p and 
Pr(X = 0) = 1 — /?. Show that E[/(Z)] < E[f(X)] for any convex function /. 

(b) Use the fact that f(x) = e tx is convex for any r > 0 to obtain a Chemoff-like 
bound for Z based on a Chemoff bound for X. 


Exercise 4.20: We prove that the Randomized Quicksort algorithm sorts a set of n 
numbers in time 0(n \ogn) with high probability. Consider the following view of 
Randomized Quicksort. Every point in the algorithm where it decides on a pivot ele¬ 
ment is called a node. Suppose the size of the set to be sorted at a particular node is 
The node is called good if the pivot element divides the set into two parts, each of size 
not exceeding 2s/3. Otherwise the node is called bad. The nodes can be thought of 
as forming a tree in which the root node has the whole set to be sorted and its children 
have the two sets formed after the first pivot step and so on. 


(a) Show that the number of good nodes in any path from the root to a leaf in this tree 
is not greater than c log 2 n, where c is some positive constant. 

(b) Show that, with high probability (greater than 1 — \/n 2 ), the number of nodes in a 
given root to leaf path of the tree is not greater than c' \ogin, where c' is another 
constant. 

(c) Show that, with high probability (greater than 1 — \/n), the number of nodes in the 
longest root to leaf path is not greater than c' log2 n. (Hint: How many nodes are 
there in the tree?) 

(d) Use your answers to show that the running time of Quicksort is 0(n logn) with 
probability at least \ — \/n. 
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Exercise 4.21: Consider the bit-fixing routing algorithm for routing a permutation on 
the n-cubc. Suppose that n is even. Write each source node as the concatenation of 
two binary strings a s and b s each of length n/2. Let the destination of x’s packet be 
the concatenation of b s and a s . Show that this permutation causes the bit-fixing routing 
algorithm to take ) steps. 


Exercise 4.22: Consider the following modification to the bit-fixing routing algorithm 
for routing a permutation on the cube. Suppose that, instead of fixing the bits in order 
from 1 to each packet chooses a random order (independent of other packets’ choices) 
and fixes the bits in that order. Show that there is a permutation for which this algo¬ 
rithm requires 2 Q{n) steps with high probability. 

Exercise 4.23: Assume that we use the randomized routing algorithm for the n-cube 
network (Algorithm 4.2) to route a total of up to p2 n packets, where each node is the 
source of no more than p packets and each node is the destination of no more than p 
packets. 

(a) Give a high-probability bound on the run-time of the algorithm. 

(b) Give a high-probability bound on the maximum number of packets at any node at 
any step of the execution of the routing algorithm. 


Exercise 4.24: Show that the expected number of packets that traverse any edge on 
the path of a given packet when routing a random permutation on the wrapped butterfly 
network of N = n2” nodes is ^(/7 2 ). 

Exercise 4.25: In this exercise, we design a randomized algorithm for the following 
packet routing problem. We are given a network that is an undirected connected graph 
G, where nodes represent processors and the edges between the nodes represent wires. 
We are also given a set of N packets to route. For each packet we are given a source 
node, a destination node, and the exact route (path in the graph) that the packet should 
take from the source to its destination. (We may assume that there are no loops in the 
path.) In each time step, at most one packet can traverse an edge. A packet can wait at 
any node during any time step, and we assume unbounded queue sizes at each node. 

A schedule for a set of packets specifies the timing for the movement of packets 
along their respective routes. That is. it specifies which packet should move and which 
should wait at each time step. Our goal is to produce a schedule for the packets that 
tries to minimize the total time and the maximum queue size needed to route all the 
packets to their destinations. 

(a) The dilation d is the maximum distance traveled by any packet. The congestion c 
is the maximum number of packets that must traverse a single edge during the en¬ 
tire course of the routing. Argue that the time required for any schedule should be 
at least Q (c 十 d), 

(b) Consider the following unconstrained schedule, where many packets may traverse 
an edge during a single time step. Assign each packet an integral delay chosen ran¬ 
domly, independently, and uniformly from the interval [1, {ac/\og(Nd)']], where 
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a is a constant. A packet that is assigned a delay of x waits in its source node for 
x time steps; then it moves on to its final destination through its specified route 
without ever stopping. Give an upper bound on the probability that more than 
0(\og(Nd)) packets use a particular edge e at a particular time step t. 

(c) Again using the unconstrained schedule of part (b), show that the probability that 
more than 0{\og(Nd)) packets pass through any edge at any time step is at most 
1 / ( Nd) for a sufficiently large a. 

(d) Use the unconstrained schedule to devise a simple randomized algorithm that, with 
high probability, produces a schedule of length OU. + d \og(Nd)) using queues of 
size 0(log(Nd)) and following the constraint that at most one packet crosses an 
edge per time step. 
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Balls ， Bins，and Random Graphs 


In this chapter, we focus on one of the most basic of random processes: m balls are 
thrown randomly into n bins, each ball landing in a bin chosen independently and uni¬ 
formly at random. We use the techniques we have developed previously to analyze 
this process and develop a new approach based on what is known as the Poisson ap¬ 
proximation. We demonstrate several applications of this model, including a more 
sophisticated analysis of the coupon collector’s problem and an analysis of the Bloom 
filter data structure. After introducing a closely related model of random graphs, we 
show an efficient algorithm for finding a Hamiltonian cycle on a random graph with 
sufficiently many edges. Even though finding a Hamiltonian cycle is NP-hard in gen¬ 
eral, our result shows that, for a randomly chosen graph, the problem is solvable in 
polynomial time with high probability. 

5.1. Example: The Birthday Paradox 

Sitting in lecture, you notice that there are 30 people in the room. Is it more likely that 
some two people in the room share the same birthday or that no two people in the room 
share the same birthday? 

We can model this problem by assuming that the birthday of each person is a random 
day from a 365-day year, chosen independently and uniformly at random for each per¬ 
son. This is obviously a simplification; for example, we assume that a person’s birthday 
is equally likely to be any day of the year, we avoid the issue of leap years, and we ig¬ 
nore the possibility of twins! As a model, however, it has the virtue of being easy to 
understand and analyze. 

One way to calculate this probability is to directly count the configurations where 
two people do not share a birthday. It is easier to think about the configurations where 
people do not share a birthday than about configurations where some two people do. 
Thirty days must be chosen from the 365; there are ( 3 為 5 ) ways to do this. These 30 
days can be assigned to the people in any of the 30! possible orders. Hence there are 
(H) 5 )30! configurations where no two people share the same birthday, out of the 365 30 
ways the birthdays could occur. Thus, the probability is 


90 






5.1 EXAMPLE ： THE BIRTHDAY PARADOX 


( 3 g)30! 

2 as 30 


(5.1) 


We can also calculate this probability by considering one person at a time. The first 
person in the room has a birthday. The probability that the second person has a differ¬ 
ent birthday is (1 — 1/365). The probability that the third person in the room then has a 
birthday different from the first two, given that the first two people have different birth¬ 
days, is (1 — 2/365). Continuing on, the probability that the kth person in the room 
has a different birthday than the first k - \, assuming that the first k - 1 have differ¬ 
ent birthdays, is (l — (k — 1)/365). So the probability that 30 people all have different 
birthdays is the product of these terms, or 



0 - 4) … 0 - 南 


You can check that this matches the expression (5.1). 

Calculations reveal that (to four decimal places) this product is 0.2937, so when 30 
people are in the room there is more than a 70% chance that two share the same birth¬ 
day. A similar calculation shows that only 23 people need to be in the room before it 
is more likely than not that two people share a birthday. 

More generally, if there are m people and n possible birthdays then the probability 
that all m have different birthdays is 

Using that \ — k/n ^ er k "' when k is small compared to n, we see that if m is small 
compared to n then 

m — \ 

U^ /n 

7 = 1 





H -功 

1 7 = 1 J 
% e - m2/2w . 


Hence the value for m at which the probability that m people all have different birth¬ 
days is 1/2 is approximately given by the equation 

m 2 

7 = ln 2 . 


or m = \/2n In2. For the case n = 365, this approximation gives m = 22.49 to two 
decimal places, matching the exact calculation quite well. 

Quite tight and formal bounds can be established using bounds in place of the ap¬ 
proximations just derived, an option that is considered in Exercise 5.3. The following 
simple arguments, however, give loose bounds and good intuition. Let us consider each 
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person one at a time, and let Ek be the event that the kth person’s birthday does not 
match any of the birthdays of the first k — 1 people. Then the probability that the first 
k people fail to have distinct birthdays is 

k 

Pr(E\ U £2 U • • • U E k ) < ^ Pr ( 左 ,.） 



k(k-\) 

2n 


If k < yfn this probability is less than 1/2, so with [s/n\ people the probability is at 
least 1/2 that all birthdays will be distinct. 

Now assume that the first \^/n ] people all have distinct birthdays. Each person after 
that has probability at least ^/n/n = 1 /^fn of having the same birthday as one of these 
first \^/n ] people. Hence the probability that the next \^/n ] people all have different 
birthdays than the first \^/n ] people is at most 


0 -士) 


「、q 



2 


Hence, once there are 2\^/n ] people, the probability is at most 1/e that all birthdays 
will be distinct. 


5.2. Balls into Bins 

5.2.1. The Balls-and-Bins Model 

The birthday paradox is an example of a more general mathematical framework that 
is often formulated in terms of balls and bins. We have m balls that are thrown into 
n bins, with the location of each ball chosen independently and uniformly at random 
from the n possibilities. What does the distribution of the balls in the bins look like? 
The question behind the birthday paradox is whether or not there is a bin with two 
balls. 

There are several interesting questions that we could ask about this random process. 
For example, how many of the bins are empty? How many balls are in the fullest bin? 
Many of these questions have applications to the design and analysis of algorithms. 

Our analysis of the birthday paradox showed that, if m balls are randomly placed 
into n bins then, for some m = at least one of the bins is likely to have more 

than one ball in it. Another interesting question concerns the maximum number of 
balls in a bin, or the maximum load. Let us consider the case where m = n, so that 
the number of balls equals the number of bins and the average load is 1. Of course the 
maximum possible load is n, but it is very unlikely that all n balls land in the same 
bin. We seek an upper bound that holds with probability tending to 1 as « grows large. 


92 




5.2 BALLS INTO BINS 


We can show that the maximum load is not more than 3 ln«/lnln« with probability at 
most \/n for sufficiently large n via a direct calculation and a union bound. This is a 
very loose bound; although the maximum load is in fact Q(\nn/\n\nn) with probabil¬ 
ity close to 1 (as we show later), the constant factor 3 we use here is chosen to simplify 
the argument and could be reduced with more care. 

Lemma 5.1: When n balls are thrown independently and uniformly at random into n 
bins, the probability that the maximum load is more than 3 \nn/\n\nn is at most i/n 
for n sufficiently large. 


Proof: The probability that bin 1 receives at least M balls is at most 


This follows from a union bound; there are (^) distinct sets of M balls, and for any set 
of M balls the probability that all land in bin 1 is (\/n) M . We now use the inequalities 


m)G) 


_， < ^! -U； - 

Here the second inequality is a consequence of the following general bound on facto¬ 
rials: since 

k l 〜 


kl 


E 


we have 


Applying a union bound again allows us to find that, for M > 3 In n/\n In n, the prob¬ 
ability that any bin receives at least M balls is bounded above by 




< n 


< n 


3\nn 
In Inn 、 


3 In ///In In n 


^ln ^ln In In ^ — In In « ^ 3 In/"In In/; 
一 21n/;+3(ln/7)(lnlnlrm)/lnlrw; 


for n sufficiently large. 


5.2.2. Application: Bucket Sort 

Bucket sort is an example of a sorting algorithm that, under certain assumptions on the 
input, breaks the Q (n logn) lower bound for standard comparison-based sorting. For 

93 




binary digits are 0000000011. When j < l. the elements of the ;th bucket all come 
before the elements in the €th bucket in the sorted order. Assuming that each element 
can be placed in the appropriate bucket in 0(1) time, this stage requires only O(n) 
time. Because of the assumption that the elements to be sorted are chosen uniformly, 
the number of elements that land in a specific bucket follows a binomial distribution 
B(n. \/n). Buckets can be implemented using linked lists. 

In the second stage, each bucket is sorted using any standard quadratic time algo¬ 
rithm (such as Bubblesort or Insertion sort). Concatenating the sorted lists from each 
bucket in order gives us the sorted order for the elements. It remains to show that the 
expected time spent in the second stage is only 0{n). 

The result relies on our assumption regarding the input distribution. Under the uni¬ 
form distribution. Bucket sort falls naturally into the balls and bins model: the elements 
are balls, buckets are bins, and each ball falls uniformly at random into a bin. 

Let Xj be the number of elements that land in the jxh bucket. The time to sort the 
jth bucket is then at most c() 2 for some constant c. The expected time spent sorting 
in the second stage is at most 



J]E[X / 2 ] = C72E[X 1 2 ], 


where the first equality follows from the linearity of expectations and the second fol¬ 
lows from symmetry, as E[X y 2 ] is the same for all buckets. 

Since X\ is a binomial random variable B(n, \/n), using the results of Section 3.2.1 
yields 


Hence the total expected time spent in the second stage is at most 2cn, so Bucket sort 
runs in expected linear time. 


5.3. The Poisson Distribution 

We now consider the probability that a given bin is empty in the balls and bins model 
with m balls and n bins as well as the expected number of empty bins. For the first bin 
to be empty, it must be missed by all m balls. Since each ball hits the first bin with 
probability \/n, the probability the first bin remains empty is 
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\' n 

I % e - w/ ”； 

of course, by symmetry this probability is the same for all bins. If Xj is a random vari¬ 
able that is 1 when the j\h bin is empty and 0 otherwise, then E[Xj] = (1 — \/n) m . 
Let X be a random variable that represents the number of empty bins. Then, by the 
linearity of expectations, 

E[X] = e[^X ( 


X]E[X ( 1 = //l 


- ^ «e 

n 


-m/n 



Thus, the expected fraction of empty bins is approximately e— W// ”. This approxima¬ 
tion is very good even for moderately size values of m and n. and we use it frequently 
throughout the chapter. 

We can generalize the preceding argument to find the expected fraction of bins with 
r balls for any constant r. The probability that a given bin has r balls is 


IV" 1 + 

n) rl n r 

When m and n are large compared to r, the second factor on the right-hand side is ap¬ 
proximately (m/n) r , and the third factor is approximately e— w "• Hence the probability 
p r that a given bin has r balls is approximately 




Pr ~ 


Q- m/n (m/n) r 


rl 


(5.2) 


and the expected number of bins with exactly r balls is approximately np,. We formal¬ 
ize this relationship in Section 5.3.1. 

The previous calculation naturally leads us to consider the following distribution. 


Definition 5.1: A discrete Poisson random variable X with parameter /i is given by the 
following probability distribution on j = 0,1, 2,: 


Pr(X = j)= 


e—V 



(Note that Poisson random variables differ from Poisson trials, discussed in Section 

4.2.1.) 

Let us verify that the definition gives a proper distribution in that the probabilities 
sum to 1: 

E p 似 ”) = E 子 

,/=o 片 (） j ■ 



where we have used the Taylor expansion e r ■)• 
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Next we show that the expectation of this random variable is \i\ 


E[X] = ^yPr(X = y) 

^ ‘e_V 
= 2^^—r- 

y:i h 


OC 

="E 

y=i 


O'-D! 


E 

/-o 


=M- 

In the context of throwing m balls into n bins, the distribution of the number of balls 
in a bin is approximately Poisson with \i = m/n. which is exactly the average number 
of balls per bin, as one might expect. 

An important property of Poisson distributions is given in the following lemma. 

Lemma 5.2: The sum of a finite number of independent Poisson random variables is 
a Poisson random variable. 


Proof: We consider two independent Poisson random variables X and Y with means 
fi\ and /X 2 ； the case of more random variables is simply handled by induction. Now 

j 

Pr(X + Y = )) = J]Pr((X = ^)n(y = j 一 k)) 

k = () 



e -(/^, + M2) (Ml + M2 )7 

= 了 ' ' 

In the last equality we used the binomial theorem to simplify the summation. _ 
We can also prove Lemma 5.2 using moment generating functions. 
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Lemma 5.3: The moment generating function of a Poisson random variable with pa¬ 
rameter is 

A / 力） =e //(e， - 1} . 


Proof: For any t, 

E[e rx ] = 


Me ’(/xe’V 


。字一 


>M(e ’ 一 1) 


Given two independent Poisson random variables X and Y with means ⑷ and \ij, we 
apply Theorem 4.3 to prove 


M x+y (t) = M x (t) - M y (t) = e ,/, ' +/< - )(e， - 1) . 

which is the moment generating function of a Poisson random variable with mean 
ji\ + fi 2 - By Theorem 4.2, the moment generating function uniquely defines the dis¬ 
tribution, and hence the sum X + K is a Poisson random variable with mean fi \ + ji 2 . 

Next we develop a Chernoff bound for Poisson random variables that we will use 
later in this chapter. 


Theorem 5.4: Let X be a Poisson random variable with parameter ji. 

1. If x > ii, then 

2. If x < ii, then 


Proof: For any t > 0 and jc > /x, 

E[ e ^i 

Pr(X >x) = Pr(e ?x > e ?v ) < — 

e f.V 

Plugging in the expression for the moment generating function of the Poisson distribu¬ 
tion, we have 

Pr(X > jc ) 苎 e〆- 1 )- 


Pr(X > x) < 


e _M (e/x) 


Pr(X <x)< 


e _M (e/x)' 


Choosing t = \n{x/ii) > 0 gives 

Pr(X>jc)<e A -^- rln( - v//i) 


For any t < 0 and x < fi. 


Pr(X <x) = Pr(e ?x > e ?r ) < 


E[e" 


Hence 


Pr(X <x)< e"〆- 1 )-" 


97 



BALLS, BINS, AND RANDOM GRAPHS 


Choosing t = \n{x/ 11 ) < 0, it follows that 

Pr(X <x)< e .'.— 加 
e—"(e/x) v 


5.3.1. Limit of the Binomial Distribution 

We have shown that, when throwing m balls randomly into b bins, the probability p r 
that a bin has r balls is approximately the Poisson distribution with mean m/b. \n gen¬ 
eral, the Poisson distribution is the limit distribution of the binomial distribution with 


parameters n and p, when n is large and p is small. More precisely, we have the fol¬ 
lowing limit result. 

Theorem 5.5: Let X n be a binomial random variable with parameters n and p, where 
p is a function of n and ^ np = k is a constant that is independent ofn. Then, 

for any fixed k, 

. d k 

lim Pr(X„ = k)= - - - 

” —oc /cl 

This theorem directly applies to the balls-and-bins scenario. Consider the situation 
where there are m balls and b bins, where m is a function of b and lim^^oo m/b = )、• 
Let X n be the number of balls in a specific bin. Then X n is a binomial random variable 
with parameters m and l/b. Theorem 5.5 thus applies and says that 


lim ?r(X n = r)= 

" — CO 


e- ni/n (m/n) r 

7 \ 


matching the approximation of Eqn. (5.2). 

Before proving Theorem 5.5, we describe some of its applications. Distributions of 
this type arise frequently and are often modeled by Poisson distributions. For exam¬ 
ple, consider the number of spelling or grammatical mistakes in a book, including this 
book. One model for such mistakes is that each word is likely to have an error with 
some very small probability p. The number of errors is then a binomial random vari¬ 
able with large n and small p that can therefore be treated as a Poisson random variable. 
As another example, consider the number of chocolate chips inside a chocolate chip 
cookie. One possible model is to split the volume of the cookie into a large number of 
small disjoint compartments, so that a chip lands in each compartment with some prob¬ 
ability p. With this model, the number of chips in a cookie roughly follows a Poisson 
distribution. We will see similar applications of the Poisson distribution in continuous 
settings in Chapter 8. 


Proof of Theorem 5,5: We can write 


Pr(X H = k) 
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In what follows, we make use of the bound that, for |x| < 1, 


e x (l - < \ + x < e x , 


(5.3) 


which follows from the Taylor series expansion of e v . (This is left as Exercise 5.6.) 
Then 


Pr(X„ =k)<--p 


a- p) n 


(1 - PY 
(np) k t~ pn 
k\ 1 - pk 
Q~ pn (np) k 


kl 


_ Pk 


The second line follows from the first by Eqn. (5.3) and the fact that (1 — p) k > \ — pk 
for k >0. Also, 

mx n = k)> - P y 

k\ 

Q -P'\{n-k+\)p) k ^ 

^ - k \ -(卜/厂”). 

where in the second inequality we applied Eqn. (5.3) with x = -p. 

Combining, we have 


、 <Pr(X n = k) < — p 、' 


kl 


In the limit, as n approaches infinity, p approaches zero because the limiting value of 
pn is the constant 入， Hence 1/(1 — pk) approaches 1, 1 — p 2 n approaches 1. and the 
difference between {n — k+\)p and np approaches 0. It follows that 


lim 

n—^oo 


t~ pn {np) 1 


r x x k 


k\ 


-pk 


and 




k\ k\ 

Since lim^^oo Pr(X„ = k) lies between these two values, the theorem follows. 


5.4. The Poisson Approximation 

The main difficulty in analyzing balls-and-bins problems is handling the dependencies 
that naturally arise in such systems. For example, if we throw m balls into n bins and 
find that bin 1 is empty, then it is less likely that bin 2 is empty because we know that 
the m balls must now be distributed among n — 1 bins. More concretely: if we know 
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the number of balls in the first n — 1 bins, then the number of balls in the last bin is 
completely determined. The loads of the various bins are not independent, and inde¬ 
pendent random variables are generally much easier to analyze, since we can apply 
Chernoff bounds. It is therefore useful to have a general way to circumvent these sorts 
of dependencies. 

We have already shown that, after throwing m balls independently and uniformly at 
random into n bins, the distribution of the number of balls in a given bin is approxi¬ 
mately Poisson with mean m jn. We would like to say that the joint distribution of the 
number of balls in all the bins is well approximated by assuming the load at each bin is 
an independent Poisson random variable with mean m/n. This would allow us to treat 
bin loads as independent random variables. We show here that we can do this when 
we are concerned with sufficiently rare events. Specifically, we show in Corollary 5.9 
that taking the probability of an event using this Poisson approximation for all of the 
bins and multiplying it by e^/m gives an upper bound for the probability of the event 
when m balls are thrown into n bins. For rare events, this extra tsfm factor will not be 
significant. To achieve this result, we now introduce some technical machinery. 

Suppose that m balls are thrown into n bins independently and uniformly at random, 
and let X- m] be the number of balls in the /th bin, where 1 < /' < n. Let Y^ m) 

be independent Poisson random variables with mean m/n. We derive a useful relation¬ 
ship between these two sets of random variables. Tighter bounds for specific problems 
can often be obtained with more detailed analysis, but this approach is quite general 
and easy to apply. 

The difference between throwing m balls randomly and assigning each bin a num¬ 
ber of balls that is Poisson distributed with mean m/n is that, in the first case, we 
know there are m balls in total, whereas in the second case we know only that m is the 
expected number of balls in all of the bins. But suppose when we use the Poisson dis¬ 
tribution we end up with m balls. In this case, we do indeed have that the distribution 
is the same as if we threw m balls into n bins randomly. 

Theorem 5.6: The distribution of {Y[ m) . Y^"^) conditioned on Y :"'、 二 k is the 

same as (X, (A ) , … ， X、 n k 、），regardless of the value of m. 

Proof: When throwing k balls into n bins, the probability that (X, (A) _ ,X„)= 

(k\,.. .,k„) for any k\ __ k n satisfying 左 ， —k is given by 



n k (kil)(k 2 \) - - - (k n l)n k ' 

Now, for any k\,. ..,k n with X],. 炙 ,. = 炙 ， consider the probability that 

conditioned on (y, (m) ,..., satisfying Y- m) = k: 
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Pr^y i (w) ,...,F„ (w) ) = (k u ...,k n ) I J2 ^ 

Pr((y, (w) = a-,) n {Y\ m) = k 2 )n---n (y,! w) = k,,)) 

= Hrurr=k) ' 

The probability that Y- m) — k { is &~ m ^ n (m/n) k, /k l !. since the Y^ m) are independent 
Poisson random variables with mean m/n. Also, by Lemma 5.2, the sum of the Y- m) is 
itself a Poisson random variable with mean m. Hence 

Pr((yf" = A-,) n (Y\ m) = h) n • • • n (y,;"" = p) — n;， =1 e— w/ »)W 

Pr(J2[L\ Y i ，n) = k ) Q- m m k jk\ 

_ k\ 

(k\\){k 2 \) ■■- (k n l)n k， 

proving the theorem. ■ 

With this relationship between the two distributions, we can prove strong results about 
any function on the loads of the bins. 

Theorem 5.7: Let f{x\,. ..,x„) be a nonnegative function. Then 

E[/U/"， … ，X,(/"))] $ eE[/(r 广,… ，: K,;" M >]• (5.4) 


Proof: We have that 



=X, ( 广 ) ]Pr(^r/ W) =m). 


where the last equality follows from the fact that the joint distribution of the Y i i m) given 
DLi Y: m 、— m is exactly that of the as shown in Theorem 5.6. Since 
is Poisson distributed with mean m, we now have 

E[/(r 广 ) ， … ， : F,; ⑷)] 2 Et/Up),... ( 广 )]^^. 

We use the following loose bound on ml, which we prove as Lemma 5.8: 



This yields 
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E[/(r, (m) ,...,> e[/(x 广) ，…， 

and the theorem is proven. ■ 


We prove the upper bound we used for factorials, which closely matches the loose lower 
bound we used in Lemma 5.1. 


Lemma 5.8: 


Proof: We use the fact that 


We first claim that, for / > 2, 


n\ < ， 

n 

ln(«!) = ^2 In 


(5.5) 


\nx dx > 


In (/ — 1) + In / 
2 


This follows from the fact that \nx is concave, since its second derivative is —\jx^ 
which is always negative. Therefore, 


In .v dx > ^ In i 


\nn 


or, equivalently, 


n\nn — n + \ > ln(«!) — 


In n 


The result now follows simply by exponentiating. 


Theorem 5.7 holds for any nonnegative function on the number of balls in the bins. In 
particular, if the function is the indicator function that is 1 if some event occurs and 0 
otherwise, then the theorem gives bounds on the probability of events. Let us call the 
scenario in which the number of balls in the bins are taken to be independent Poisson 
random variables with mean m jn the Poisson case, and the scenario where m balls are 
thrown into n bins independently and uniformly at random the exact case. 


Corollary 5.9: Any event that takes place with probability p in the Poisson case takes 
place with probability at most pe^/m in the exact case. 

Proof: Let / be the indicator function of the event. In this case, E[/J is just the proba¬ 
bility that the event occurs, and the result follows immediately from Theorem 5.7. ■ 


This is a quite powerful result. It says that any event that happens with small proba¬ 
bility in the Poisson case also happens with small probability in the exact case, where 
balls are thrown into bins. Since in the analysis of algorithms we often want to show 
that certain events happen with small probability, this result says that we can utilize an 
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analysis of the Poisson approximation to obtain a bound for the exact case. The Pois¬ 
son approximation is easier to analyze because the numbers of balls in each bin are 
independent random variables. 1 

We can actually do even a little bit better in many natural cases. The proof of the 
following theorem is outlined in Exercise 5.14. 

Theorem 5.10: Let f(x\,.. . ， x n ) be a nonnegative function such that E[/(X, (m) ,. 
X { n m) )^ is either monotonically increasing or monotonically decreasing in m. Then 

E[f(Xl rn \...,X ( n m) )] < 2E[/(r, (w) ,...,F n (w) )]. (5.6) 

The following corollary is immediate. 

Corollary 5.11: Let 8 be an event whose probability is either monotonically increas¬ 
ing or monotonically decreasing in the number of balls. If £ has probability p in the 
Poisson case, then E has probability at most 2 p in the exact case. 


To demonstrate the utility of this corollary, we again consider the maximum load prob¬ 
lem for the case m = n. We have shown via a union bound argument that the maximum 
load is at most 3ln«/ln \nn with high probability. Using the Poisson approximation, 
we prove the following almost-matching lower bound on the maximum load. 


Lemma 5.12: When n balls are thrown independently and uniformly at random into 
n bins, the maximum load is at least \nn/\n\nn with probability at least 1 — 1 /n for n 
sufficiently large. 

Proof: In the Poisson case, the probability that bin 1 has load at least M = \nn/\n\nn 
is at least 1/eM!, which is the probability it has load exactly M. In the Poisson case, 
all bins are independent, so the probability that no bin has load at least M is at most 

(\- 丄丫 < e - 琳…) • 

V eM!/ - 

We now need to choose M so that < n 2 , for then (by Theorem 5.7) we will 

have that the probability that the maximum load is not at least M in the exact case is at 
most Qy/n/n 2 < \/n. This will give the lemma. Because the maximum load is clearly 
monotonically increasing in the number of balls, we could also apply the slightly better 
Theorem 5.10, but this would not affect the argument substantially. 

It therefore suffices to show that Ml < «/2e \nn, or equivalently that In Ml < 
\nn — In Inn — ln(2e). From our bound (5.5), it follows that 

厂 （ M\ m (M\ m 

M\ < eVM( — j < ) 

1 There are other ways to handle the dependencies in the balls-and-bins model. In Chapter 12 we describe a more 
general way to deal with dependencies (using martingales) that applies here. Also, there is a theory of negative 
dependence that applies to balls-and-bins problems that also allows these dependencies to be dealt with nicely. 
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when n (and hence M = ln«/lnln«) are suitably large. Hence, for n suitably large, 


InM! < M In M - M + \n M 


Inn Inn 

(\n\nn — In In In/!) - h (lnlnw — In In Inn: 


\n\nn 


In \nn 


- [nn -^r n 

< \nn — In In « — ln(2e). 


where in the last two inequalities we have used the fact that In inn = o(\nn/\n\nn). 

■ 

5.4.1.* Example: Coupon Collector’s Problem, Revisited 

The coupon collector’s problem introduced in Section 2.4.1 can be thought of as a balls- 
and-bins problem. Recall that in this problem there are n different types of coupons, 
each cereal box yields a coupon chosen independently and uniformly at random from 
the n types, and you need to buy cereal boxes until you collect one of each coupon. If 
we think of coupons as bins and cereal boxes as balls, the question becomes: If balls are 
thrown independently and uniformly at random into bins, how many balls are thrown 
until all bins have at least one ball? We showed in Section 2.4.1 that the expected num¬ 
ber of cereal boxes necessary is nH(n) % n In in Section 3.3.1 we showed that, if 
there are « In n + cn cereal boxes, then the probability that not all coupons are collected 
is at most e _r . These results translate immediately to the balls-and-bins setting. The 
expected number of balls that must be thrown before each bin has at least one ball is 
nH(n), and when n In /? + cn balls are thrown the probability that not all bins have at 
least one ball is e— r . 

We have seen in Chapter 4 that Chernoff bounds yield concentration results for sums 
of independent 0-1 random variables. We will use here a Chernoff bound for the Pois¬ 
son distribution to obtain much stronger results for the coupon collector’s problem. 


Theorem 5.13: Let X he the number of coupons observed before obtaining one of each 
of n types of coupons. Then, for any constant c, 

lim Pr[X > n In n + cn] = 1 — e _e . 

This theorem states that, for large n, the number of coupons required should be very 
close to « In «. For example, over 98% of the time the number of coupons required lies 
between n\nn — An and n In" + An. This is an example of a sharp threshold, where 
the random variable is closely concentrated around its mean. 

Proof: We look at the problem as a balls-and-bins problem. We begin by consider¬ 
ing the Poisson approximation, and then demonstrate that the Poisson approximation 
gives the correct answer in the limit. For the Poisson approximation, we suppose that 
the number of balls in each bin is a Poisson random variable with mean In « + c, so that 
the expected total number of balls is m = n Inn + cn. The probability that a specific 
bin is empty is then 
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e -(ln» + r) _ ^_ 

n 

Since all bins are independent under the Poisson approximation, the probability that 
no bin is empty is 



The last approximation is appropriate in the limit as n grows large, so we apply it here. 

To show the Poisson approximation is accurate, we undertake the following steps. 
Consider the experiment where each bin has a Poisson number of balls, each with mean 
\nn-\-c. Letf be the event that no bin is empty, and let X be the number of balls thrown. 
We have seen that 

lim Pr(^) = e _e . 

n^oc 

We use Pr(^) by splitting it as follows: 

Pr(^) = Pr(^ fl (IX — /nI < \j2m \nm )) + Pr(£ ， n (|X — 川 | > V2m In m )) 

= Vr{£ \ \X — m \ < V 2m In m ) - Pr (|X — m \ < \/ 2m In m ) 

+ Pr(^ I \X — m\ > \/ 2m \nm ) ‘ Pr(|X — m \ > 'Jim In m ). (5.7) 

This representation proves helpful once we establish two facts. First. v\e show that 
Pr(|X-m| > V 2m \nm ) is of 1); that is, the probability that in the Poisson case the 
number of balls thrown deviates significantly from its mean m is «(1). This guarantees 
that the second term in the summation on the right of Eqn. (5.7) is o(1). Second, we 
show that 

|Pr(^ I \X — m\ < V 2m In w ) — Pr(S \ X = m) | = o{ 1). 

That is, the difference between our experiment coming up with exactly m balls or just 
almost m balls makes an asymptotically negligible difference in the probability that 
every bin has a ball. With these two facts, Eqn. (5.7) becomes 

Pr(^) = ?r(S I |X - m\ < V2mlnm) . Pr(\X - m\ < Vim \nm ) 

+ Pr(^ I \X — m\ > \Jlm\xun ) - Pr( \X — m\ > v 7 2m In m ) 

=Pr(^ I \X — m\ < \/ 2m In m ) . (1 — "(1)) + <9(1) 

= Pt(8 I X = m){\ — o(l))+ o(l), 

and hence 

lim Pr(^) = lim Pr(^ | X = m). 

n^oc n--^oc 

But from Theorem 5.6, the quantity on the right is equal to the probability that every 
bin has at least one ball when m balls are thrown randomly, since conditioning on m 
total balls with the Poisson approximation is equivalent to throwing m balls randomly 
into the n bins. As a result, the theorem follows once we have shown these two facts. 

To show that Pr(|X — m\ > sjlm \nm ) is 6>(1), consider that X is a Poisson ran¬ 
dom variable with mean m, since it is a sum of independent Poisson random variables. 
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We use the Chemoff bound for the Poisson distribution (Theorem 5.4) to bound this 
probability, writing the bound as 

Pr(X > ^) < e x - m - x ln( - v / w) . 

For x = m + ^/2m In m, we use that ln(l + z) > z — z 2 /2 for z > 0 to show 
Pr(X > m + V2m lnm ) < in(]+%/ 2 inm/m) 

< gV 2m In m — (rn+ V 2m In m ) (V2 \nm/m — \n m/m) 

一 In 川十 \/ 2m \nm(\n m/m) 一 

A similar argument holds if x < m, so Pr(|X — m \ > \/2m In m ) = o(l). 

We now show the second fact, that 

|Pr(^ \ \X-m\< V2m lnm) - ?r(S | X = m)| = o(l). 

Note that Pr(5 \ X = k) is increasing in k, since this probability corresponds to the 
probability that all bins are nonempty when k balls are thrown independently and uni¬ 
formly at random. The more balls that are thrown, the more likely all bins are nonempty. 
It follows that 

Pr[£ \ X = m - V 2m lnm) < Pr[S \ \X —m\ < V 2m In m ) 

< Pr(^ \ X = m + V 2m lnm ). 

Hence we have the bound 
\Pr(£ I |X-m| < V2mlnm) - ?r(S \ X = m)\ 

< Pr(f \ X — m + V 2m lnm ) - Pr(^ \ X = m — V2m lnm ), 

and we show the right-hand side is o(l). This is the difference between the probability 
that all bins receive at least one ball when m — lnm balls are thrown and when 
m + \/ 2m lnm balls are thrown. This difference is equivalent to the probability of 
the following experiment: we throw m — sjlm lnm balls and there is still at least one 
empty bin, but after throwing an additional Isjlm lnm balls, all bins are nonempty. 
In order for this to happen, there must be at least one empty bin after m — ^/2m\nm 
balls; the probability that one of the next 2^/2m lnm balls covers this bin is at most 
(2\/2m lnm )/n = o(l) by the union bound. Hence this difference is <?(1) as well. ■ 

5.5. Application: Hashing 

5.5.1. Chain Hashing 

The balls-and-bins-model is also useful for modeling hashing. For example, consider 
the application of a password checker, which prevents people from using common, eas¬ 
ily cracked passwords by keeping a dictionary of unacceptable passwords. When a user 
tries to set up a password, the application would like to check if the requested pass¬ 
word is part of the unacceptable set. One possible approach for a password checker 
would be to store the unacceptable passwords alphabetically and do a binary search on 
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the dictionary to check if a proposed password is unacceptable. A binary search would 
require 0(log m) time for m words. 

Another possibility is to place the words into bins and then search the appropriate 
bin for the word. The words in a bin would be represented by a linked list. The place¬ 
ment of words into bins is accomplished by using a hash function. A hash function / 
from a universe U into a range [0,n - 1] can be thought of as a way of placing items 
from the universe into n bins. Here the universe U would consist of possible password 
strings. The collection of bins is called a hash table. This approach to hashing is called 
chain hashing, since items that fall in the same bin are chained together in a linked list. 

Using a hash table turns the dictionary problem into a balls-and-bins problem. If our 
dictionary of unacceptable passwords consists of m words and the range of the hash 
function is [0, n — 1], then we can model the distribution of words in bins with the same 
distribution as m balls placed randomly in n bins. We are making a rather strong as¬ 
sumption by presuming that our hash function maps words into bins in a fashion that 
appears random, so that the location of each word is independent and identically dis¬ 
tributed. There is a great deal of theory behind designing hash functions that appear 
random, and we will not delve into that theory here. We simply model the problem by 
assuming that hash functions are random. In other words, we assume that (a) for each 
x e U, the probability that f(x) = j is 1 /n (for 0 < j < n — \ ) and that (b) the values 
of f(x) for each jc are independent of each other. Notice that this does not mean that 
every evaluation of f(x) yields a different random answer! The value of /( i ) is fixed 
for all time; it is just equally likely to take on any value in the range. 

Let us consider the search time when there are n bins and m words. To search for an 
item, we first hash it to find the bin that it lies in and then search sequentially through the 
linked list for it. If we search for a word that is not in our dictionary, the expected num¬ 
ber of words in the bin the word hashes to is m/n. If we search for a word that is in our 
dictionary, the expected number of other words in that word’s bin is (m — \)/n. so the ex¬ 
pected number of words in the bin is 1 + (m —1)/«. If we choose n = m bins for our hash 
table, then the expected number of words we must search through in a bin is constant. If 
the hashing takes constant time, then the total expected time for the search is constant. 

The maximum time to search for a word, however, is proportional to the maximum 
number of words in a bin. We have shown that when n = m this maximum load is 
0 (ln«/ln \nn) with probability close to 1, and hence with high probability this is the 
maximum search time in such a hash table. While this is still faster than the required 
time for standard binary search, it is much slower than the average, which can be a 
drawback for many applications. 

Another drawback of chain hashing can be wasted space. If we use n bins for n 
items, several of the bins will be empty, potentially leading to wasted space. The space 
wasted can be traded off against the search time by making the average number of 
words per bin larger than 1. 


5.5.2. Hashing: Bit Strings 

If we want to save space instead of time, we can use hashing in another way. Again, 
we consider the problem of keeping a dictionary of unsuitable passwords. Assume that 
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a password is restricted to be eight ASCII characters, which requires 64 bits (8 bytes) 
to represent. Suppose we use a hash function to map each word into a 32-bit string. 
This string will serve as a shon fingerprint for the word; just as a fingerprint is a suc¬ 
cinct way of identifying people, the fingerprint string is a succinct way of identifying 
a word. We keep the fingerprints in a sorted list. To check if a proposed password is 
unacceptable, we calculate its fingerprint and look for it on the list, say by a binary 
search. 2 If the fingerprint is on the list, we declare the password unacceptable. 

In this case, our password checker may not give the correct answer! It is possible 
for a user to input an acceptable password, only to have it rejected because its finger¬ 
print matches the fingerprint of an unacceptable password. Hence there is some chance 
that hashing will yield a false positive', it may falsely declare a match when there is 
not an actual match. The problem is that - unlike fingerprints for human beings - our 
fingerprints do not uniquely identify the associated word. This is the only type of mis¬ 
take this algorithm can make; it does not allow a password that is in the dictionary of 
unsuitable passwords. In the password application, allowing false positives means our 
algorithm is overly conservative, which is probably acceptable. Letting easily cracked 
passwords through, however, would probably not be acceptable. 

To place the problem in a more general context, we describe it as an approximate 
set membership problem. Suppose we have a set S = {.vi, ... ,s ni } of m elements 

from a large universe U. We would like to represent the elements in such a way that we 
can quickly answer queries of the form "Is a an element of ST' We would also like the 
representation to take as little space as possible. In order to save space, we would be 
willing to allow occasional mistakes in the form of false positives. Here the unallow¬ 
able passwords correspond to our set S. 

How large should the range of the hash function used to create the fingerprints be? 
Specifically, if we are working with bits, how many bits should we use to create a fin¬ 
gerprint? Obviously, we want to choose the number of bits that gives an acceptable 
probability for a false positive match. The probability that an acceptable password has a 
fingerprint that is different from any specific unallowable password in 5 is (1 — l/2 ;, ). It 
follows that if the set S has size m and if we use b bits for the fingerprint, then the prob¬ 
ability of a false positive for an acceptable password is 1 — (1 — \/2 h ) m > 1 — e 一 w/2) . 
If we want this probability of a false positive to be less than a constant c, we need 


which implies that 


e~" ! - ' > 1 - c. 


b > l02' 


ln(l/(l - c)) 


That is, we need b = Q(log 2 m) bits. On the other hand, if we use b = 2 log 2 m bits, 
then the probability of a false positive falls to 




m 


In this case the fingerprints will be uniformly distributed over nil 32-bit strings. There are faster algorithms 
for searching over sets of numbers with this distribution, just as Bucket sort allows faster sorting than stan¬ 
dard comparison-based sorting when the elements to be sorted are from a uniform distribution, but we will not 
concern ourselves with this point here. 
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Figure 5.1: Example of how a Bloom filter functions 


In our example, if our dictionary has 2 16 = 65,536 words, then using 32 bits when 
hashing yields a false positive probability of just less than 1/65.536. 

5.5.3. Bloom Filters 

We can generalize the hashing ideas of Sections 5.5.1 and 5.5.2 to achieve more in¬ 
teresting trade-offs between the space required and the false positive probability. The 
resulting data structure for the approximate set membership problem is called a Bloom 
filter. 

A Bloom filter consists of an array of n bits, A[0] to A[n — 1]. initially all set 

to 0. A Bloom filter uses k independent random hash functions h\ .with range 

{0,..., /I — 1}. We make the usual assumption for analysis that these hash functions map 

each element in the universe to a random number uniformly over the range {0.—1}. 

Suppose that we use a Bloom filter to represent a set S = y.v,„} of m elements 

from a large universe U. For each element L v e S, the bits Af/z,(.v)J are set to 1 for 1 < / < 
k. A bit location can be set to 1 multiple times, but only the first change has an effect. 
To check if an element x is in 5, we check whether all array locations A[hi(x )] for 1 < 
i < k are set to 1. If not, then clearly x is not a member of 5, because if x were in S then 
all locations A[h, (x)] for 1 < i < k would be set to 1 by construction. If all A[hi(x)] 
are set to 1, we assume that j is in 5", although we could be wrong. We would be wrong 
if all of the positions A[h ； (x)] were set to 1 by elements of S even though .r is not in 
the set. Hence Bloom filters may yield false positives. Figure 5.1 shows an example. 
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The probability of a false positive for an element not in the set - ihe ： false positive 
probability - can be calculated in a straightforward fashion, given our assumption that 
the hash functions are random. After all the elements of S are hashed into the Bloom 
filter, the probability that a specific bit is still 0 is 

1 \ km 

- j % Q -_ 11 

We let p — e—To simplify the analysis, let us temporarily assume that a fraction 
p of the entries are still 0 after all of the elements of S are hashed into the Bloom filter. 

The probability of a false positive is then 

j \km\k 

j ^ (1 — e-= (\- p) k . 

We let / = (\—e— krn " r ) k = (1 — p) k . From now on, for convenience we use the asymp¬ 
totic approximations p and / to represent (respectively) the probability that a bit in the 
Bloom filter is 0 and the probability of a false positive. 

Suppose that we are given m and n and wish to optimize the number of hash func¬ 
tions k in order to minimize the false positive probability /. There are two competing 
forces: using more hash functions gives us more chances to find a 0-bit for an element 
that is not a member of S, but using fewer hash functions increases the fraction of 0-bits 
in the array. The optimal number of hash functions that minimizes / as a function of 
k is easily found taking the derivative. Let g = k ln(l — t~ km/n ), so that / = and 
minimizing the false positive probability f is equivalent to minimizing g with respect 
iok. We find 




i = ln(l—e- 々 w/,, ) + 
dk 


km 

n 


^-km/n 
— 亡 一 km/n 


It is easy to check that the derivative is zero when k = {\n2) • (n/m) and that this 
point is a global minimum. In this case the false positive probability / is (1/2) A， ~ 
(0.6185) ,,/w . The false positive probability falls exponentially in «/m, the number of 
bits used per item. In practice, of course, k must be an integer, so the best possible 
choice of k may lead to a slightly higher false positive rate. 

A Bloom filter is like a hash table, but instead of storing set items we simply use one 
bit to keep track of whether or not an item hashed to that location. \f k = 1, we have 
just one hash function and the Bloom filter is equivalent to a hashing-based fingerprint 
system, where the list of the fingerprints is stored in a 0-1 bit array. Thus Bloom fil¬ 
ters can be seen as a generalization of the idea of hashing-based fingerprints. As we 
saw when using fingerprints, to get even a small constant probability of a false positive 
required Q(logm) fingerprint bits per item. In many practical applications, Q(logm) 
bits per item can be too many. Bloom filters allow a constant probability of a false pos¬ 
itive while keeping n/m, the number of bits of storage required per item, constant. For 
many applications, the small space requirements make a constant probability of error 
acceptable. For example, in the password application, we may be willing to accept 
false positive rates of 1% or 2%. 
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Bloom filters are highly effective even if n = cm for a small constant c, such as 
c = 8. In this case, when A: = 5 or A ： = 6 the false positive probability is just over 0.02. 
This contrasts with the approach of hashing each element into 0(log m) bits. Bloom 
filters require significantly fewer bits while still achieving a very good false positive 
probability. 

It is also interesting to frame the optimization another way. Consider /, the proba¬ 
bility of a false positive, as a function of p. We find 

f = (\-p) k 

_ _ /))( — ln P ){}n：n) 

= (e— ln(/ ” ln(1 —(5.8) 

From the symmetry of this expression, it is easy to check that p = 1/2 minimizes the 
false positive probability /. Hence the optimal results are achieved when each bit of the 
Bloom filter is 0 with probability 1/2. An optimized Bloom filter looks like a random 
bit string. 

To conclude, we reconsider our assumption that the fraction of entries that are still 0 
after all of the elements of S are hashed into the Bloom filter is p. Each bit in the array 
can be thought of as a bin, and hashing an item is like throwing a ball. The fraction of 
entries that are still 0 after all of the elements of S are hashed is therefore equivalent to 
the fraction of empty bins after mk balls are thrown into n bins. Let X be the number 
of such bins when mk balls are thrown. The expected fraction of such bins is 



The events of different bins being empty are not independent, but we can apply 
Corollary 5.9, along with the Chernoff bound of Eqn. (4.6), to obtain 

Pv(\X - np'\ > sn) < 2eV^e— 狀 :/3 〆. 

Actually, Corollary 5.11 applies as well, since the number of 0-entries - which corre¬ 
sponds to the number of empty bins - is monotonically decreasing in the number of 
balls thrown. The bound tells us that the fraction of empty bins is close to // (when 
n is reasonably large) and that p' is very close to p. Our assumption that the fraction 
of 0-entries in the Bloom filter is p is therefore quite accurate for predicting actual 
performance. 

5.5.4. Breaking Symmetry 

As our last application of hashing, we consider how hashing provides a simple way 
to break symmetry. Suppose that n users want to utilize a resource, such as time on a 
supercomputer. They must use the resource sequentially, one at a time. Of course, each 
user wants to be scheduled as early as possible. How can we decide a permutation of 
the users quickly and fairly? 

If each user has an identifying name or number, hashing provides one possible so¬ 
lution. Hash each user’s identifier into 2 h bits, and then take the permutation given by 
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the sorted order of the resulting numbers. That is, the user whose identifier gives the 
smallest number when hashed comes first, and so on. For this approach to work, we 
do not want two users to hash to the same value, since then we must decide again how 
to order these users. 

If b is sufficiently large, then with high probability the users will all obtain distinct 
hash values. One can analyze the probability that two hash values collide by using the 
analysis from Section 5.1 for the birthday paradox; hash values correspond to birth¬ 
days. We here use a simpler analysis similar to that used for analyzing fingerprints in 
Section 5.5.2. Consider the point of view of one user. The probability that some other 
user obtains the same hash value is 

卜(卜 士） 

By the union bound, the probability that any user has the same hash value as another is 
at most n(n — \)/2 b . Hence, choosing b = 3 logi n guarantees success with probabil¬ 
ity at least 1 — \/n. 

This solution is extremely flexible, making it useful for many situations in distrib¬ 
uted computing. For example, new users can easily be added into the schedule at any 
time, as long as they do not hash to the same number as another scheduled user. 

A related problem is leader election. Suppose that instead of trying to order all of 
the users, we simply want to fairly choose a leader from them. Again, if we have a suit¬ 
ably random hash function then we can simply take the user whose hash value is the 
smallest. An analysis of this scheme is left as Exercise 5.25. 


5.6. Random Graphs 


5.6.1. Random Graph Models 

There are many NP-hard computational problems defined on graphs: Hamiltonian cycle, 
independent set, vertex cover, and so forth. One question worth asking is whether these 
problems are hard for most inputs or just for a relatively small fraction of all graphs. 
Random graph models provide a probabilistic setting for studying such questions. 

Most of the work on random graphs has focused on two closely related models, G„. p 
and G",/v. In G, up we consider all undirected graphs on n distinct vertices ui, i» 2 ,.. v n . 
A graph with a given set of m edges has probability 

p m {\- 

One way to generate a random graph in G, up is to consider each of the ⑴ possible 
edges in some order and then independently add each edge to the graph with probabil¬ 
ity p. The expected number of edges in the graph is therefore ( 2 )/ 7 , and each vertex 
has expected degree (n — \)p. 

In the G n 、N model, we consider all undirected graphs on n vertices with exactly N 
edges. There are ((|)) possible graphs, each selected with equal probability. One way 
to generate a graph uniformly from the graphs in G n ^ N is to start with a graph with no 
edges. Choose one of the ( 2 ) possible edges uniformly at random and add it to the edges 
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in the graph. Now choose one of the remaining ⑺ 一 1 possible edges independently 
and uniformly at random and add it to the graph. Similarly, continue choosing one of 
the remaining unchosen edges independently and uniformly at random until there are 
N edges. 

The G n , p and G n ^ models are related; when p — N/^), the number of edges in a 
random graph in G lhp is concentrated around N, and conditioned on a graph from G, up 
having N edges, that graph is uniform over all the graphs from G n N . The relationship 
is similar to the relationship between throwing m balls into n bins and having each bin 
have a Poisson distributed number of balls with mean m/n. 

Indeed, there are many similarities between random graphs and the balls-and-bins 
models. Throwing edges into the graph as in the G /? .,v model is like throwing balls into 
bins. However, since each edge has two endpoints, each edge is like throwing two balls 
at once into two different bins. The pairing defined by the edges adds a rich structure 
that does not exist in the balls-and-bins model. Yet we can often utilize the relation be¬ 
tween the two models to simplify analysis in random graph models. For example, in the 
coupon collector’s problem we found that when we throw n In n + cn balls, the proba¬ 
bility that there are any empty bins converges to e~ e ' as n grows to infinity. Similarly, 
we have the following theorem for random graphs, which is left as Exercise 5.19. 

Theorem 5.14: Let N = \(n\nn + cn). Then the probability that there are any iso¬ 
lated vertices (vertices with degree 0) in G n ‘N converges to e _c as n i>ro\vs to infinity. 

5.6.2. Application: Hamiltonian Cycles in Random Graphs 

A Hamiltonian path in a graph is a path that traverses each vertex exactly once. A 
Hamiltonian cycle is a cycle that traverses each vertex exactly once. We show an inter¬ 
esting connection between random graphs and balls-and-bins problems by analyzing a 
simple and efficient algorithm for finding Hamiltonian cycles in random graphs. The 
algorithm is randomized, and its probabilistic analysis is over both the input distribu¬ 
tion and the random choices of the algorithm. Finding a Hamiltonian cycle in a graph 
is an NP-hard problem. However, our analysis of this algorithm shows that finding a 
Hamiltonian cycle is not hard for suitably randomly selected graphs, even though it 
may be hard to solve in general. 

Our algorithm will make use of a simple operation called a rotation . Let G be an 
undirected graph. Suppose that 


P — V\,V2, ■ ■ ■ 

is a simple path in G and that ( 14 , u, ) is an edge of G. Then 

P' = + W 十 1 

is also a simple path, which we refer to as the rotation of P with the rotation edge 
(vk, Vi): see Figure 5.2. 

We first consider a simple, natural algorithm that proves challenging to analyze. We 
assume that our input is presented as a list of adjacent edges for each vertex in the graph, 
with the edges of each list being given in a random order according to independent and 
uniform random permutations. Initially, the algorithm chooses an arbitrary vertex to 
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V\ V 2 V 3 v 4 v 5 v 6 i'l v 2 i'3 V 4 v 6 


Figure 5.2: The rotation of the path i>], v 2 , v 3 , 〜 v 6 with the edge (v 6 , t> 3 ) yields a new path 

L’l, V 2 ,V 3 , V 6 , V 5 , V 4 . 

start the path; this is the initial head of the path. The head is always one of the endpoints 
of the path. From this point on, the algorithm either “grows” the path deterministically 
from the head, or rotates the path - as long as there is an adjacent edge remaining on 
the head’s list. See Algorithm 5.1. 

The difficulty in analyzing this algorithm is that, once the algorithm views some 
edges in the edge lists, the distribution of the remaining edges is conditioned on the 
edges the algorithm has already seen. We circumvent this difficulty by considering a 
modified algorithm that, though less efficient, avoids this conditioning issue and so is 
easier to analyze for the random graphs we consider. See Algorithm 5.2. Each vertex 
v keeps two lists. The list used-edges ⑻ contains edges adjacent to v that have been 
used in the course of the algorithm while v was the head; initially this list is empty. 
The list unused-edges(i>) contains other edges adjacent to v that have not been used. 

We initially analyze the algorithm assuming a specific model for the initial unused- 
edges lists. We subsequently relate this model to the G, up model for random graphs. 
Assume that each of the n — 1 possible edges connected to a vertex v is initially on the 
unused-edges list for vertex v independently with some probability q. We also assume 
these edges are in a random order. One way to think of this is that, before beginning the 
algorithm, we create the unused-edges list for each vertex v by inserting each possible 
edge (v, u) with probability we think of the corresponding graph G as being the graph 
including all edges that were inserted on some unused-edges list. Notice that this means 
an edge (v, u) could initially be on the unused-edges list for v but not for u. Also, when 
an edge (v, u) is first used in the algorithm, if v is the head then it is removed just from the 
unused-edges list of v: if the edge is on the unused-edges list for w, it remains on this list. 

By choosing the rotation edge from either the used-edges list or the unused-edges 
list with appropriate probabilities and then reversing the path with some small proba¬ 
bility in each step, we modify the rotation process so that the next head of the list is 
chosen uniformly at random from among all vertices of the graph. Once we establish 
this property, the progress of the algorithm can be analyzed through a straightforward 
application of our analysis of the coupon collector's problem. 

The modified algorithm appears wasteful; reversing the path or rotating with one of 
the used edges cannot increase the path length. Also, we may not be taking advantage 
of all the possible edges of G at each step. The advantage of the modified algorithm is 
that it proves easier to analyze, owing to the following lemma. 

Lemma 5.15: Suppose the modified Hamiltonian cycle algorithm is run on a graph 
chosen using the described model. Let V t be the head vertex after the tth step. Then, 
for any vertex u , as long as at the rth step there is at least one unused edge available 
at the head vertex, 
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Hamiltonian Cycle Algorithm: 

Input: A graph G = (V,E) with n vertices. 

Output: A Hamiltonian cycle, or failure. 

1. Start with a random vertex as the head of the path. 

2. Repeat the following steps until the rotation edge closes a Hamiltonian cycle 

or the unused-edges list of the head of the path is empty: 

(a) Let the current path be 尸 = 1 ’ 卜 1 ， .iV where is the head, and let 

be the first edge in the head's list. 

(b) Remove from the head’s list and u's list. 

(c) If u Vj for 1 < / < ^, add u = v^+i to the end of the path and make it 
the head. 

(d) Otherwise, if u = i»y, rotate the current path with ( 17 , f/) and set v i+ \ 
to be the head. (This step closes the Hamiltonian path if k = n and the 
chosen edge is (d #m i-| ).) 

3. Return a Hamiltonian cycle if one was found or failure if no cycle was found. 


Algorithm 5 . 1 : Hamiltonian cycle algorithm. 


Modified Hamiltonian Cycle Algorithm: 

Input: A graph G — (V, E) with n vertices and associated edge lists. 

Output: A Hamiltonian cycle, or failure. 

1. Start with a random vertex as the head of the path. 

2. Repeat the following steps until the rotation edge closes a Hamiltonian cycle 
or the unused-edges list of the head of the path is empty: 

(a) Let the current path be P = i \. r ： .q-. with 1 ** being the head. 

(b) Execute i. ii. or iii with probabilities \/n. |used-edges(i , i t)|// 7 . and 
1 — 1//7 — |used-edges( 17 ) j //?. respectively: 

i. Reverse the path, and make f| the head. 

ii. Choose uniformly at random an edge from used-edges(i ，々）； if the 
edge is (v/；, f, ), rotate the current path with (u*-, u, ) and set i; f+ i to be 
the head. (If the edge is ( 17 . then no change is made.) 

iii. Select the first edge from u 11 used-edges (1 ， 人）， call it (17,1/). If u r, 
for 1 < / < k, add u = i，n to the end of the path and make it the 
head. Otherwise, if u = r,. rotate the current path with ( 1 ，人 _ ， i',) and 
set Vj+i to be the head. (This step doses the Hamiltonian path if 

k = n and the chosen edge is (i' M , 

(c) Update the used-edges and unused-edges lists appropriately. 

3. Return a Hamiltonian cycle if one was found or failure if no cycle was found. 
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Pr( K+i = u \ Vf = w；, Vf — i = Wf—i, ..., V{) = Mo) = \/n. 

That is, the head vertex can be thought of as being chosen uniformly at random from 
all vertices at each step, regardless of the history o f the process. 


Proof: Consider the possible cases when the path is 尸 = v\,v 2 , ■ ■ 

The only way v\ can become the head is if the path is reversed, so V t+ \ = ui with 
probability 1 /n. 

If u — lv+1 is a vertex that lies on the path and (v k , Vi ) is in used-edges (i)/,), then 
the probability that V t+ \ = u h 

|used-edges(u/.)| 1 1 

n |used-edges(Lv)| n 

If u is not covered by one of the first two cases then we use the fact that, when 
an edge is chosen from unused-edges( 14 )，the adjacent vertex is uniform over all the 
n — I used-edges (u 々 ）| — 1 remaining vertices. This follows from the principle of de¬ 
ferred decisions. Our initial setup required the unused-edges list for u/, to be con¬ 
structed by including each possible edge with probability q and randomizing the order 
of the list. This is equivalent to choosing X neighboring vertices for Vf；, where X is a 
B{n — \,q) random variable and the X vertices are chosen uniformly at random with¬ 
out replacement. Because list was determined independently from the lists of the 
other vertices, the history of the algorithm tells us nothing about the remaining edges 
in unused-edges(i^), and the principle of deferred decisions applies. Hence any edge 
in v^'s unused-edges list that we have not seen is by construction equally likely to con¬ 
nect to any of the n — |used-edges(i'/, )| — 1 remaining possible neighboring vertices. 

If u = u,- + i is a vertex on the path but (v k . r ,) is not in used-edges(u 々 ），then 
the probability that V t+ \ = 11 is the probability that the edge (v k ,Vi) is chosen from 
unused-edges(u/,) as the next rotation edge, which is 


1 Iused-edges(r^)! 


n — |used-edges(; 


Finally, if u is not on the path, then the probability that V t+ i = 11 is the probability 
that the edge (u/. + i, 11 ) is chosen from unused-edges(u/,). But this has the same proba¬ 
bility as in Eqn. (5.9). ■ 


For Algorithm 5.2, the problem of finding a Hamiltonian path looks exactly like the 
coupon collector’s problem; the probability of finding a new vertex to add to the path 
when there are k vertices left to be added is k/n. Once all the vertices are on the 
path, the probability that a cycle is closed in each rotation is \jn. Hence, if no list 
of unused-edges is exhausted then we can expect a Hamiltonian path to be formed in 
about 0(n \nn) rotations, with about another Oin \nn ) rotations to close the path to 
form a Hamiltonian cycle. More concretely, we can prove the following theorem. 


Theorem 5.16: Suppose the input to the modified Hamiltonian cycle algorithm initially 
has unused-edge lists where each edge (v, u ) with u ^ v is placed on v’s list indepen¬ 
dently with probability q > 20 \nn /n. Then the algorithm successfully finds a Hamilton- 
ian cycle in 0(n \nn) iterations of the repeat loop (step 2) with probability 1 — 0(n~ l ). 
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Note that we did not assume that the input random graph has a Hamiltonian cycle. A 
corollary of the theorem is that, with high probability, a random graph chosen in this 
way has a Hamiltonian cycle. 

Proof of Theorem 5.16: Consider the following two events. 

E \： The algorithm ran for 3n \nn steps with no unused-edges list becoming empty, but 
it failed to construct a Hamiltonian cycle. 

£2 ： At least one unused-edges list became empty during the first 3/7 \nn iterations of 
the loop. 

For the algorithm to fail, either event S\ or must occur. We first bound the proba¬ 
bility of 8\. Lemma 5.15 implies that, as long as there is no empty unused-edges list in 
the first 3« \nn iterations of step 2 of Algorithm 5.2. in each iteration the next head of 
the path is uniform among the n vertices of the graph. To bound £\, we therefore con¬ 
sider the probability that more than 3n Inn iterations are required to find a Hamiltonian 
cycle when the head is chosen uniformly at random each iteration. 

The probability that the algorithm takes more than In In n iterations to find a Hamil¬ 
tonian path is exactly the probability that a coupon collector's problem on n types 
requires more than 2n ln« coupons. The probability that any specific coupon type has 
not been found among 2n In n random coupons is 



By the union bound, the probability that any coupon type is not found is at most 1 jn. 

In order to complete a Hamiltonian path to a cycle the path must close, w hich it does 
at each step with probability l/n. Hence the probability that the path does not become 
a cycle within the next n In n iterations is 



Thus we have shown that 

Pr(^i) < 

n 

Next we bound Pr(£ 2 ). the probability that an unused-edges list is empty in the first 
3n Inn iterations. We consider two subevents as follows. 

Ei a \ At least 9 In n edges were removed from the unused-edges list of at least one ver¬ 
tex in the first 3n In n iterations of the loop. 

Sib- At least one vertex had fewer than 10Inn edges initially in its unused-edges list. 

For ^2 to occur, either £ 2 a or ^ 2 b must occur. Hence 

Pr(^ 2 ) < Pr(^2d) + Pr(^ 2 /；)- 

Let us first bound Pr(^ 2 «)- Exactly one edge is used in each iteration of the loop. 
From the proof of Lemma 5.15 we have that, at each iteration, the probability that a 
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given vertex v is the head of the path is \/n, independently at each step. Hence the 
number of times X that v is the head during the first 3n \nn steps is a binomial ran¬ 
dom variable B(3n \nn, \/n), and this dominates the number of edges taken from v 7 s 
unused-edges list. 

Using the Chernoff bound of Eqn. (4.1) with 5 = 2 and /x = 3 In « for the binomial 
random variable B(3n In n, 1/n), we have 

3 In n I 

I < 4 - 

n A 

By taking a union bound over all vertices, we find Pr(E 2 a ) 5 l/«- 

Next we bound Pr(£ 2 b)- The expected number of edges Y initially in a vertex’s 
unused-edges list is at least {n — \)q > 20(n — 1) \nn/n > 19 \nn for sufficiently large 
n. Using Chemoff bounds again (Eqn. (4.5)), the probability that any vertex initially 
has 10 Inn edges or fewer on its list is at most 


Pr(X > 9\nn) < 


27 


Pr(F < 101n«) < e - I9In » ,9/I9,： / 2 < -1, 

n ‘ 

and by the union bound the probability that any vertex has too few adjacent edges is at 
most l/n. Thus, 

Pr(^2/j) 5 - 
n 


and hence 

2 

Pr(f) < 

n 

In total, the probability that the algorithm fails to find a Hamiltonian cycle in 3« In « 
iterations is bounded by 

Pr(<fi) +Pr(^ 2 ) < -• ■ 

n 

We did not make an effort to optimize the constants in the proof. There is, however, a 
clear trade-off; with more edges, one could achieve a lower probability of failure. 

We are left with showing how our algorithm can be applied to graphs in G n p . We 
show that, as long as p h known, we can partition the edges of the graph into edge lists 
that satisfy the requirements of Theorem 5.16. 


Corollary 5.17: By initializing edges on the unused-edges lists appropriately, Algo¬ 
rithm 5.2 will find a Hamiltonian cycle on a graph chosen randomly from G„, p with 
probability 1 — 0{\/n) whenever p > 40 \nn/n. 

Proof: We partition the edges of our input graph from G fKp as follows. Let^ e [0,1] be 
such that p = 2q —q 1 . Consider any edge (m, v) in the input graph. We execute exactly 
one of the following three possibilities: with probability q{\ — q)/{2q — q 2 ) we place 
the edge on m’s unused-edges list but noton u’s; with probability g(l — q) / {2q — q 2 ) we 
initially place the edge on u’s unused-edges list but not on m’s; and with the remaining 
probability q 2 J{2q — q 2 ) the edge is placed on both unused-edges lists. 


118 




5,7 EXERCISES 


Now, for any possible edge (w, v), the probability that it is initially placed in the 
unused-edges list for v is 


>(1 ~q) 


2^ - q 2 2q - q 2 


Moreover, the probability that an edge (m, v) is initially placed on the unused-edges 
list for both u and v is pq 2 /{2q — q 2 ) = q 2 , so these two placements are independent 


events. Since each edge (m, v) is treated independently, this partitioning fulfills the re¬ 
quirements of Theorem 5.16 provided the resulting q is at least 201n«/«. When p > 
40\nn/n we have q > p/2 > 201n«/«, and the result follows. ■ 


In Exercise 5.26, we consider how to use Algorithm 5.2 even in the case where p is not 
known in advance, so that the edge lists must be initialized without knowledge of p. 


5.7. Exercises 


Exercise 5.1: For what values of « is (1 + l/n)" within l % of e? Within 0.0001% of e? 
Similarly, for what values of n is (1 — \/n)" within 1% of 1/e? Within 0.0001%? 

Exercise 5.2: Suppose that Social Security numbers were issued uniformly at random, 
with replacement. That is, your Social Security number would consist of just nine ran¬ 
domly generated digits, and no check would be made to ensure that the same number 
was not issued twice. Sometimes, the last four digits of a Social Security number are 
used as a password. How many people would you need to have in a room before it was 
more likely than not that two had the same last four digits? How many numbers could 
be issued before it would be more likely than not that there is a duplicate number? How 
would you answer these two questions if Social Security numbers had 13 digits? Try 
to give exact numerical answers. 


Exercise 5.3: Suppose that balls are thrown randomly into n bins. Show, for some 
constant c'i, that if there are c\^/n balls then the probability that no two land in the same 
bin is at most 1 /e. Similarly, show for some constant ci (and sufficiently large n) that, 
if there are ci-Jn balls, then the probability that no two land in the same bin is at least 
1/2. Make these constants as close to optimal as possible. Hint: You may want to use 
the facts that 

e A > 1 -x 


and 


Exercise 5.4: In a lecture hall containing 100 people, you consider whether or not there 
are three people in the room who share the same birthday. Explain how to calculate 
this probability exactly, using the same assumptions as in our previous analysis. 

Exercise 5.5: Let X be a Poisson random variable with mean fi, representing the num¬ 
ber of errors on a page of this book. Each error is independently a grammatical error 
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with probability p and a spelling error with probability \ — p. U Y and Z are random 
variables representing the number of grammatical and spelling errors (respectively) on 
a page of this book, prove that Y and Z are Poisson random variables with means /ip 
and /i(l — p), respectively. Also, prove that Y and Z are independent. 


Exercise 5.6: Use the Taylor expansion 


ln(l + x) = x 一 - — + —-- — + 


to prove that, for any x with |x| < 1 , 

e v (l — x 2 ) 5 1 + .V < e A . 


Exercise 5.7: Suppose that n balls are thrown independently and uniformly at random 
into n bins. 

(a) Find the conditional probability that bin 1 has one ball given that exactly one ball 
fell into the first three bins. 

(b) Find the conditional expectation of the number of balls in bin 1 under the condition 
that bin 2 received no balls. 

(c) Write an expression for the probability that bin 1 receives more balls than bin 2. 

Exercise 5.8: Our analysis of Bucket sort in Section 5.2.2 assumed that n elements 
were chosen independently and uniformly at random from the range [0, 2 k ). Suppose 
instead that n elements are chosen independently from the range [0, 2 k ) according to a 
distribution with the property that any number .v G [ 0 , 2 k ) is chosen with probability at 
most a/2 k for some fixed constant a > 0. Show that, under these conditions. Bucket 
sort still requires linear expected time. 

Exercise 5.9: Consider the probability that every bin receives exactly one ball when 
n balls are thrown randomly into n bins. 

(a) Give an upper bound on this probability using the Poisson approximation. 

(b) Determine the exact probability of this event. 

(c) Show that these two probabilities differ by a multiplicative factor that equals the 
probability that a Poisson random variable with parameter n takes on the value n. 
Explain why this is implied by Theorem 5.6. 

Exercise 5.10: Consider throwing m balls into n bins, and for convenience let the 
bins be numbered from 0 tew? — 1. We say there is a k-gap starting at bin i if bins 
+ 1 ， ，，，，/+/: — 1 are all empty, 

(a) Determine the expected number of 众 -gaps, 

(b) Prove a Chernoff-like bound for the number of A-gaps. (Hint: If you let X, = 1 
when there is a k-gstp starting at bin i, then there are dependencies between and 
X/ + i ； to avoid these dependencies, you might consider X, and X i+k .) 
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Exercise 5.11: The following problem models a simple distributed system wherein 
agents contend for resources but ‘‘back off，’ in the face of contention. Balls represent 
agents, and bins represent resources. 

The system evolves over rounds. Every round, balls are thrown independently and 
uniformly at random into n bins. Any ball that lands in a bin by itself is served and 
removed from consideration. The remaining balls are thrown again in the next round. 
We begin with n balls in the first round, and we finish when every ball is served. 

fa) If there are b balls at the start of a round, what is the expected number of balls at 
the start of the next round? 

(b) Suppose that every round the number of balls served was exactly the expected num¬ 
ber of balls to be served. Show that all the balls would be served in 0( log log ，?） 
rounds. {Hint: If is the expected number of balls left after / rounds, show and 
use that ,\j + \ < xjjn.) 

Exercise 5.12: Suppose that we vary the balls-and-bins process as follows. For conve¬ 
nience let the bins be numbered from 0 to w — 1. There are log ： n players. Each player 
randomly chooses a starting location t uniformly from [0. /; — 1 ] and then places one 

ball in each of the bins numbered i mod « , £ + 1 mod n . I — ii log ； n — 1 mod n . 

Argue that the maximum load in this case is only (9(log \ogn /log log log n ) with prob¬ 
ability that approaches 1 as n oc. 

Exercise 5.13: We prove that if Z is a Poisson random variable of mean /i. w here /i > 
1 is an integer, then Pr(Z > /i) > 1/2 and Pr(Z < /i) > 1/2. 

(a) Show that Pr(Z = "+/?)> Pr(Z = /i — /? — 1) for 0 < /? < // — 1. 

(b) Using part (a), argue that Pr(Z > //) > 1/2. 

(c) Show that Pr(Z = /u - h) > Pr(Z = ^ ^ + 1) for 0 < /? < /i, 

(d) Determine a lower bound on Pr(Z = /i - h ) — Pr(Z = /i + h ^ \ ). 

(e) Determine an upper bound on Pr(Z > 2". + 2). 

(f) Using parts (c)-(e). argue that Pr(Z < /i) > 1/2. 

Exercise 5.14: (a) In Theorem 5.7 we showed that, for any nonnegative functions /, 

E[/(r;〃' ... ， K,； W) )J > E[/(X； W) ,.. . ， X,'" 0 )] Pr(^K； /,n =m). 

Prove that if E[/(Xj (w) ,..., Xl' n) )] is monotonically increasing in m. then 

E[/(r;" 0 ，...，y,;" ;) )] 仝 O 川)， 

again under the condition that / is nonnegative. Make a similar statement for the case 
when E[/(X, (W) , Xl m] )\ is monotonically decreasing in m. 

(b) Using part (a) and Exercise 5.13, Prove Theorem 5.10. 


Exercise 5.15: We consider another way to obtain Chernoff-like bounds in the setting 
of balls and bins without using Theorem 5.7. Consider n balls thrown randomly into 
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n bins. Let X t = 1 if the /th bin is empty and 0 otherwise. Let X = Xj. Let 
Yi, i = 1,be independent Bernoulli random variables that are 1 with probability 
p = (\^l/nr. Let r = J：'； =l Yi. 

(a) Show that E[X l X 2 --X k ] < E[K,F 2 • • • ^] for any k > \. 

(b) Show that E[e ?x ] < E[e’ r J for all ? > 0. (Hint: Use the expansion for e A and 
compare EfX A ] to E|T 勺 .） 

(c) Derive a Chernoff bound for Pr(X > (1+5)E[X]). 


Exercise 5.16: Let G be a random graph generated using the G n ' p model. 

(a) A clique of k vertices in a graph is a subset of k vertices such that all (g) edges be¬ 
tween these vertices lie in the graph. For what value of p, as a function of n, is the 
expected number of cliques of five vertices in G equal to 1? 

(b) A K': graph is a complete bipartite graph with three vertices on each side. In 
other words, it is a graph with six vertices and nine edges; the six distinct vertices 
are arranged in two groups of three, and the nine edges connect each of the nine 
pairs of vertices with one vertex in each group. For what value of /?, as a function 
of is the expected number of K '、 subgraphs of G equal to 1 ? 

(c) For what value of /?, as a function of is the expected number of Hamiltonian 
cycles in the graph equal to 1 ? 


Exercise 5.17: Theorem 5.7 shows that any event that occurs with small probability in 
the balls-and-bins setting where the number of balls in each bin is an independent Pois¬ 
son random variable also occurs with small probability in the standard balls-and-bins 
model. Prove a similar statement for random graphs: Every event that happens with 
small probability in the G ll p model also happens with small probability in the G n ' N 
model for N = ("jp- 

Exercise 5.18: An undirected graph on n vertices is disconnected if there exists a set 
of k < n vertices such that there is no edge between this set and the rest of the graph. 
Otherwise, the graph is said to be connected. Show that there exists a constant c such 
that if > cn \ogn then, with probability 0 (e~ H ), a graph randomly chosen from 
G fuN is connected. 

Exercise 5.19: Prove Theorem 5.14. 


Exercise 5.20: (a) Let f(n ) be the expected number of random edges that must be 
added before an empty undirected graph with n vertices becomes connected. (Con¬ 
nectedness is defined in Exercise 5.18.) That is, suppose that we start with a graph on 
n vertices with zero edges and then repeatedly add an edge, chosen uniformly at ran¬ 
dom from all edges not currently in the graph, until the graph becomes connected. If 
X n represents the number of edges added, then f(n) = E[X„], 

Write a program to estimate f{n ) for a given value of n. Your program should track 
the connected components of the graph as you add edges until the graph becomes con¬ 
nected. You will probably want to use a disjoint set data structure, a topic covered in 
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standard undergraduate algorithms texts. You should try n = 100, 200. 300. 400, 500, 
600, 700, 800, 900, and 1000. Repeat each experiment 100 times, and for each value of 
n compute the average number of edges needed. Based on your experiments, suggest 
a function h(n) that you think is a good estimate for fin). 

(b) Modify your program for the problem in part (a) so that it also keeps track of 
isolated vertices. Let g( n) be the expected number of edges added before there are no 
more isolated vertices. What seems to be the relationship between f(n) and g(i?)? 

Exercise 5.21: In hashing with open addressing, the hush table is implemented as an 
array and there are no linked lists or chaining. Each entry in the array either contains 
one hashed item or is empty. The hash function defines, for each key k, a probe se¬ 
quence h(k, 0), h(k, 1)，... of table locations. To insert the ke> k. we first examine the 
sequence of table locations in the order defined by the key's probe sequence until we 
find an empty location; then we insert the item at that position. When searching for 
an item in the hash table, we examine the sequence of table locations in the order de¬ 
fined by the key's probe sequence until either the item is found or u e have found an 
empty location in the sequence. If an empty location is found, this means the item is 
not present in the table. 

An open-address hash table with 2n entries is used to store n items. Assume that 
the table location h(k, j) is uniform over the 2n possible table locations and that all 
h(k, j) are independent. 

(a) Show that, under these conditions, the probability of an insertion requiring more 
than k probes is at most 2- k . 

(b) Show that, for i = 1,2,.. ,,n, the probability that the /th insertion requires more 
than 2 log « probes is at most 1 /n 1 . 


Let the random variable X, denote the number of probes required b\ the / th insertion. 
You have shown in part (b) that Pr(X, > 2 log«) < \/n 2 . Let the random variable X = 
maxi<,<„ X, denote the maximum number of probes required by an\ of the n insertions. 

(c) Show that Pr(X > 2 logn) < \/n. 

(d) Show that the expected length of the longest probe sequence is E[ X ] = 0{ log/;). 

Exercise 5.22: Bloom filters can be used to estimate set differences. Suppose you 
have a set X and I have a set K both with n elements. For example, the sets might rep¬ 
resent our 100 favorite songs. We both create Bloom filters of our sets, using the same 
number of bits m and the same k hash functions. Determine the expected number of 
bits where our Bloom filters differ as a function of m, k, and |X n }.’!，Explain how 
this could be used as a tool to find people with the same taste in music more easily than 
comparing lists of songs directly. 


Exercise 5.23: Suppose that we wanted to extend Bloom filters to allow deletions as 
well as insertions of items into the underlying set. We could modify the Bloom fil¬ 
ter to be an array of counters instead of an array of bits. Each time an item is inserted 
into a Bloom filter, the counters given by the hashes of the item are increased by one. 
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To delete an item, one can simply decrement the counters. To keep space small, the 
counters should be a fixed length, such as 4 bits. 

Explain how errors can arise when using fixed-length counters. Assuming a setting 
where one has at most n elements in the set at any time, m counters, k hash functions, 
and counters with b bits, explain how to bound the probability that an error occurs over 
the course of t insertions or deletions. 

Exercise 5.24: Suppose that you built a Bloom filter for a dictionary of words with 
m = 2 b bits. A co-worker building an application wants to use your Bloom filter but 
has only 2 b ~ ] bits available. Explain how your colleague can use your Bloom filter to 
avoid rebuilding a new Bloom filter using the original dictionary of words. 


Exercise 5.25: For the leader election problem alluded to in Section 5.5.4, we have n 
users, each with an identifier. The hash function takes as input the identifier and out¬ 
puts a b-bit hash value, and we assume that these values are independent and uniformly 
distributed. Each user hashes its identifier, and the leader is the user with the smallest 
hash value. Give lower and upper bounds on the number of bits b necessary to ensure 
that a unique leader is successfully chosen with probability p. Make your bounds as 
tight as possible. 

Exercise 5.26: Consider Algorithm 5.2, the modified algorithm for finding Hamilton¬ 
ian cycles. We have shown that the algorithm can be applied to find a Hamiltonian 
cycle with high probability in a graph chosen randomly from G tup , when p is known 
and sufficiently large, by initially placing edges in the edge lists appropriately. Argue 
that the algorithm can similarly be applied to find a Hamiltonian cycle with high prob¬ 
ability on a graph chosen randomly from G„. N when N = C\n \nn for a suitably large 
constant c!, Argue also that the modified algorithm can be applied even when p is not 
known in advance as long as p is at least ci In n jn for a suitably large constant ci- 


5.8. An Exploratory Assignment 


Part of the research process in random processes is first to understand what is going on 
at a high level and then to use this understanding in order to develop formal mathemat¬ 
ical proofs. In this assignment, you will be given several variations on a basic random 
process. To gain insight, you should perform experiments based on writing code to 
simulate the processes. (The code should be very short, a few pages at most.) After the 
experiments, you should use the results of the simulations to guide you to make conjec¬ 
tures and prove statements about the processes. You can apply what you have learned 
up to this point, including probabilistic bounds and analysis ofballs-and-bins problems. 

Consider a complete binary tree with N = 2" — \ nodes. Here n is the depth of the 
tree. Initially, all nodes are unmarked. Over time, via processes that we shall describe, 
nodes becomes marked. 

All of the processes share the same basic form. We can think of the nodes as having 
unique identifying numbers in the range of [1, N], Each unit of time, I send you the 
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Figure 5.3: The arrival of X causes all other nodes to be marked. 

identifier of a node. When you receive a sent node, you mark it. Also, you invoke the 
following marking rule, which takes effect before I send out the next node. 


• If a node and its sibling are marked, its parent is marked. 

• If a node and its parent are marked, the other sibling is marked. 

The marking rule is applied recursively as much as possible before the next node 
is sent. For example, in Figure 5.3, the marked nodes are filled in. The arrival of the 
node labeled by an X will allow you to mark the remainder of the nodes, as you apply 
the marking rule first up and then down the tree. Keep in mind that you always apply 
the marking rule as much as possible. 

Now let us consider the different ways in which I might be sending you the nodes. 

Process 1: Each unit of time, I send the identifier of a node chosen independently ami 
uniformly at random from all of the N nodes. Note that I might send you a node that 
is already marked, and in fact I may send a useless node that I have already sent. 
Process 2: Each unit of time I send the identifier of a node chosen uniformly at ran¬ 
dom from those nodes that I have not yet sent. Again, a node that has already been 
marked might arrive, but each node will be sent at most once. 

Process 3: Each unit of time I send the identifier of a node chosen uniformly at random 
from those nodes that you have not yet marked. 


We want to determine how many time steps are needed before all the nodes are 
marked for each of these processes. Begin by writing programs to simulate the send¬ 
ing processes and the marking rule. Run each process ten times for each value of n in 
the range [10,20]. Present the data from your experiments in a clear, easy-to-read fash¬ 
ion and explain your data suitably. A tip: You may find it useful to have your program 
print out the last node that was sent before the tree became completely marked. 

1. For the first process, prove that the expected number of nodes sent is Q(N log N). 
How well does this match your simulations? 

2. For the second process, you should find that almost all N nodes must be sent before 
the tree is marked. Show that, with constant probability, at least N - 2s/N nodes 
must be sent. 

3. The behavior of the third process might seem a bit unusual. Explain it with a proof. 

After answering these questions, you may wish to consider other facts you could prove 
about these processes. 
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The Probabilistic Method 


The probabilistic method is a way of proving the existence of objects. The underly¬ 
ing principle is simple: to prove the existence of an object with certain properties, we 
demonstrate a sample space of objects in which the probability is positive that a ran¬ 
domly selected object has the required properties. If the probability of selecting an 
object with the required properties is positive, then the sample space must contain such 
an object, and therefore such an object exists. For example, if there is a positive proba¬ 
bility of winning a million-dollar prize in a raffle, then there must be at least one raffle 
ticket that wins that prize. 

Although the basic principle of the probabilistic method is simple, its application to 
specific problems often involves sophisticated combinatorial arguments. In this chap¬ 
ter we study a number of techniques for constructing proofs based on the probabilistic 
method, starting with simple counting and averaging arguments and then introducing 
two more advanced tools, the Lovasz local lemma and the second moment method. 

In the context of algorithms we are generally interested in explicit constructions 
of objects, not merely in proofs of existence. In many cases the proofs of existence 
obtained by the probabilistic method can be converted into efficient randomized con¬ 
struction algorithms. In some cases, these proofs can be converted into efficient de¬ 
terministic construction algorithms; this process is called demmlomiz'ation ，since it 
converts a probabilistic argument into a deterministic one. We give examples of both 
randomized and deterministic construction algorithms arising from the probabilistic 
method. 


6.1. The Basic Counting Argument 

To prove the existence of an object with specific properties, we construct an appropri¬ 
ate probability space S of objects and then show that the probability that an object in 
5 with the required properties is selected is strictly greater than 0. 

For our first example, we consider the problem of coloring the edges of a graph with 
two colors so that there are no large cliques with all edges having the same color. Let 
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K„ be a complete graph (with all ( 2 ) edges) on n vertices. A clique of k vertices in K n 
is a complete subgraph K k . 

Theorem 6.1: If (t) 2 ~^)+ i < 1 ， then it is possible to color the edges of K n with two 
colors so that it has no monochromatic K k subgraph. 


Proof: Define a sample space consisting of all possible colorings of the edges of K„ 
using two colors. There are 2(3) possible colorings, so if one is chosen uniformly at 
random then the probability of choosing each coloring in our probability space is 2-0. 
A nice way to think about this probability space is: if we color each edge of the graph 
independently, with each edge taking each of the two colors with probability 1/2, then 
we obtain a random coloring chosen uniformly from this sample space. That is, we flip 
an independent fair coin to determine the color of each edge. 

Fix an arbitrary ordering of all of the (J, ? ) different A:-vertex cliques of K n , and for 
i = 1 ， ... ，⑺ let 山 be the event that clique / is monochromatic. Once the first edge 
in clique i is colored, the remaining (^) — 1 edges must all be given the same color. It 
follows that 


Pr(A,) = 2 —O 1 . 


Using a union bound then yields 



<J]Pr(A ; ) 



< 1 . 


where the last inequality follows from the assumptions of the theorem. Hence 





> 0 . 


Since the probability of choosing a coloring with no monochromatic ^-vertex clique 
from our sample space is strictly greater than 0, there must exist a coloring with no 
monochromatic 々 -vertex clique. ■ 


As an example, consider whether the edges of ATi(_)()o can be 2-colored in such a way 
that there is no monochromatic K 20 . Our calculations are simplified if we note that, for 
n < 2 k/1 and k > 3, 


U7 — k'r 

2 人 /2+1 

< - 

- k\ 

< 1 . 


Observing that for our example n = 1000 < 2 10 = 2 k/2 , we see that by Theorem 6.1 
there exists a 2-coloring of the edges of K 腿 ) with no monochromatic K 20 . 
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Can we use this proof to design an efficient algorithm to construct such a coloring? 
Let us consider a general approach that gives a randomized construction algorithm. 
First, we require that we can efficiently sample a coloring from the sample space. In 
this case sampling is easy, because we can simply color each edge independently with 
a randomly chosen color. In general, however, there might not be an efficient sampling 
algorithm. 

If we have an efficient sampling algorithm, the next question is: How many sam¬ 
ples must we generate before obtaining a sample that satisfies our requirements? If 
the probability of obtaining a sample with the desired properties is p and if we sample 
independently at each trial, then the number of samples needed before finding a sam¬ 
ple with the required properties is a geometric random variable with expectation \/p. 
Hence we need that 1 fp be polynomial in the problem size in order to have an algorithm 
that finds a suitable sample in polynomial expected time. 

\i p = \ — o{\), then sampling once gives a Monte Carlo construction algorithm 
that is incorrect with probability <?(1). In our specific example of finding a coloring on 
a graph of 1000 vertices with no monochromatic K 20 , we know that the probability that 
a random coloring has a monochromatic ^20 is at most 


2 20/2+1 

20! 


< 8.5 - KT 16 . 


Hence we have a Monte Carlo algorithm with a small probability of failure. 

If we want a Las Vegas algorithm - that is, one that always gives a correct construc¬ 
tion - then we need a third ingredient. We require a polynomial time procedure for 
verifying that a sample object satisfies the requirements; then we can test samples until 
we find one that does so. An upper bound on the expected time for this construction 
can be found by multiplying together the expected number of samples 1 fp, an upper 
bound on the time to generate each sample, and an upper bound on the time to check 
each sample. 1 For the coloring problem, there is a polynomial time verification algo¬ 
rithm when A: is a constant: simply check all ⑺ cliques and make sure they are not 
monochromatic. It does not seem that this approach can be extended to yield polyno¬ 
mial time algorithms when k grows with n. 


6.2. The Expectation Argument 


As we have seen, in order to prove that an object with certain properties exists, we can 
design a probability space from which an element chosen at random yields an object 
with the desired properties with positive probability. A similar and sometimes easier 
approach for proving that such an object exists is to use an averaging argument. The 
intuition behind this approach is that, in a discrete probability space, a random variable 
must with positive probability assume at least one value that is no greater than its ex¬ 
pectation and at least one value that is not smaller than its expectation. For example, if 


Sometimes the time to generate or check a sample may itself be a random variable. In this case, Wald's equation 
(discussed in Chapter 12) may apply. 
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the expected value of a raffle ticket is at least $3, then there must be at least one ticket 
that ends up being worth no more than $3 and at least one that ends up being worth no 
less than $3. 

More formally, we have the following lemma. 

Lemma 6.2: Suppose we have a probability space S and a random variable X defined 
on S such that E[X] = /z. Then Pr(X > //) > 0 and Pr(X < //) > 0. 


Proof: We have 

M = E[X] = ^]xPr(X = a), 

-V 

where the summation ranges over all values in the range of X. If Pr(X > /z) = 0, then 
fi = ^2 x Pr(X = x) = ^2 x Pr(X = x) < ^ /I Pr(X = x) = fi, 

X X<(1 X<(1 

giving a contradiction. Similarly, if Pr(X < /jl) — 0 then 

fi = Pr(X = x) = Pr(X = x) > ^/xPr(X = x) — fi, 

.X X> (1 .X>(1 

again yielding a contradiction. ■ 

Thus, there must be at least one instance in the sample space of S for which the value 
of X is at least /x and at least one instance for which the value of X is no greater than fi. 


6.2.1. Application: Finding a Large Cut 

We consider the problem of finding a large cut in an undirected graph. A cut is a par¬ 
tition of the vertices into two disjoint sets, and the value of a cut is the weight of all 
edges crossing from one side of the partition to the other. Here we consider the case 
where all edges in the graph have the same weight 1. The problem of finding a max¬ 
imum cut is NP-hard. Using the probabilistic method, we show that the value of the 
maximum cut must be at least 1/2 the number of edges in the graph. 

Theorem 6.3: Given an undirected graph G with n vertices and m edges, there is a 
partition ofV into two disjoint sets A and B such that at least m/2 edges connect a 
vertex in A to a vertex in B. That is, there is a cut with value at least m/2. 


Proof: Construct sets A and B by randomly and independently assigning each vertex 
to one of the two sets. Let ej,..e m be an arbitrary enumeration of the edges of G. 
For i = 1,..., m, define X, such that 

I I if edge i connects A to B, 

0 otherwise. 

The probability that edge e, connects a vertex in A to a vertex in B is 1/2, and thus 

129 




THE PROBABILISTIC METHOD 


E[X,] = 

Let C(A, B) be a random variable denoting the value of the cut corresponding to the 
sets A and B. Then 


E[C(A ?J g)] 

Since the expectation of the random variable C(A, B) is m/2, there exists a partition 
A and B with at least m/2 edges connecting the set A to the set B. ■ 




1 m 

m ■—= — 

2 2 


We can transform this argument into an efficient algorithm for finding a cut with value 
at least m/2. We first show how to obtain a Las Vegas algorithm. In Section 6.3, we 
show how to construct a deterministic polynomial time algorithm. 

It is easy to randomly choose a partition as described in the proof. The expectation 
argument does not give a lower bound on the probability that a random partition has a 
cut of value at least m/2. To derive such a bound, let 


/ m \ 

P = Pr(C(A, B)>—y 

and observe that C(A, B) < m. Then 
^=E[C(A,B)] 

=J2 1 Pr ( c (人 = 0 + J2 1 ^(C(A,B) = i) 

i<m/2 -1 />w/2 

< (1 - p)(; - 1) + pm, 
which implies that 


P - m/2 十 1. 

The expected number of samples before finding a cut with value at least m/2 is there¬ 
fore just m/2 + 1. Testing to see if the value of the cut determined by the sample is at 
least m/2 can be done in polynomial time simply by counting the edges crossing the 
cut. We therefore have a Las Vegas algorithm for finding the cut. 

6.2.2. Application: Maximum Satisfiability 


We can apply a similar argument to the maximum satisfiability (MAXSAT) problem. 
In a logical formula, a literal is either a Boolean variable or the negation of a Boolean 
variable. We use x to denote the negation of the variable x. A satisfiability (SAT) prob¬ 
lem, or a SAT formula, is a logical expression that is the conjunction (AND) of a set 
of clauses, where each clause is the disjunction (OR) of literals. For example, the fol¬ 
lowing expression is an instance of SAT: 
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(X] V V X3) A (J] V J^) A (.¥] V X2 V .V4) A (X4 V X3) A (.Y4 V .?]). 


A solution to an instance of a SAT formula is an assignment of the variables to the val¬ 
ues True and False so that all the clauses are satisfied. That is, there is at least one true 
literal in each clause. For example, assigning .vj to True, .v： to False, A 3 to False, and A 4 
to True satisfies the preceding SAT formula. In general, determining if a SAT formula 
has a solution is NP-hard. 

A related goal, given a SAT formula, is satisfying as many of the clauses as pos¬ 
sible. In what follows, let us assume that no clause contains both a variable and its 
complement, since in this case the clause is always satisfied. 

Theorem 6.4: Given a set of m clauses, let kj be the number of literals in the /th clause 
tor i — 1. Let k — min , /^ ] k t . Then there is a truth assignment that satisfies at 
least 


111 

^(1 - 2 ~ kl ) > m (\ - 2 - k ) 

i=\ 

clauses. 


Proof: Assign values independently and uniformly at random to the variables. The 
probability that the /th clause with ki literals is satisfied is at least (1 — 2~ ki ). The ex¬ 
pected number of satisfied clauses is therefore at least 

m 

- 2 —人 0 > m(l — 2 —人）， 
i=i 

and there must be an assignment that satisfies at least that many clauses. ■ 

The foregoing argument can also be easily transformed into an efficient randomized 
algorithm; the case where all 人'/ 二 /: is left as Exercise 6.1. 

6.3. Derandomization Using Conditional Expectations 

The probabilistic method can yield insight into how to construct deterministic algo¬ 
rithms. As an example, we apply the method of conditional expectations in order to 
dcnmdomize the algorithm of Section 6.2.1 for finding a large cut. 

Recall that we find a partition of the n vertices V of a graph into sets A and B by 
placing each vertex independently and uniformly at random in one of the two sets. This 
gives a cut with expected value E[C(A, B)] > m/2. Now imagine placing the vertices 

deterministically, one at a time, in an arbitrary order i ； i, U2,_ v„. Let .r, be the set 

w here is placed (so Xj is either A or B). Suppose that we have placed the first k 
\ertices, and consider the expected value of the cut if the remaining vertices are then 
placed independently and uniformly into one of the two sets. We write this quantity 
as E[C(A, B) I X], a 2 , . -., x k ]\ it is the conditional expectation of the value of the cut 

given the locations .ri,X2, _of the first k vertices. We show inductively how to 

place the next vertex so that 


131 






THE PROBABILISTIC METHOD 


E[C(A, B) I xi,x 2 , - - -,x k ] < E[C(A, B) \ x\,x 2 , ...,x k+ i]. 

It follows that 

E[C(A, B)] $ E[C(A ， 5) I .\-i,x 2 ,...,x„]. 

The right-hand side is the value of the cut determined by our placement algorithm, 
since if x\,xi,.. .,x n are all determined then we have a cut of the graph. Hence our 
algorithm returns a cut whose value is at least E[C(A, 5)] > m/2. 

The base case in the induction is 

E[C(A,B) I Xl ] = E[C(A,B)], 


which holds by symmetry because it does not matter where we place the first vertex. 
We now prove the inductive step, that 

E[C(A, B) I x u x 2 , ...,x k ] < E[C(A, B) \ x,,x 2 ,. ..,x k+l ]. (6.1) 

Consider placing 14+1 randomly, so that it is placed in A or 5 with probability 1/2 
each, and let Y^+i be a random variable representing the set where it is placed. Then 

E[C(A, B) \ xi,x 2 ,.. .,x k ] = ~E[C(A, B) \ x u x 2 ,.. .,x k ,Y k+x = A] 

+ ~E[C{A, B) I xi,x 2 ,. ..,x k ,Y k+l = B]. 

It follows that 

max(E[C(A,5) | xj,x 2 ,. ..,x k ,Y k+] = A],E[C(A, B) \ x u x 2 ,... ,x k ,Y k+{ = fi]) 

> E[C(A, B) I x h x 2 ,...,x k ]. 

Therefore, all we have to do is compute the two quantities E[C(A,B) \ xj,X 2 , •••， 
Xk,Yk+\ — M and E[C(A,fi) | ^i,X 2 ,... F^ + i = B] and then place the i^.+i in 
the set that yields the larger expectation. Once we do this, we will have a placement 
satisfying 


E[C(A, B) I xi,x 2 , ...,x k ] < E[C(A,B) \ xi,x 2 ,... ,x k+i ]. 

To compute E[C(A, B) \ x\,x 2 ,.. -,x k , Y k+ \ = A], note that the conditioning gives 
the placement of the first k + 1 vertices. We can therefore compute the number of edges 
among these vertices that contribute to the value of the cut. For all other edges, the prob¬ 
ability that it will later contribute to the cut is 1 / 2 , since this is the probability its two 
endpoints end up on different sides of the cut. By linearity of expectations, E[C(A, B) \ 
xi,x 2 ,-- .,x k ,Y k+ \ = A] is the number of edges crossing the cut whose endpoints are 
both among the first k + 1 vertices, plus half of the remaining edges. This is easy to 
compute in linear time. The same is true for E[C(A, B) \ X],X 2 ,... ,x^, Yk + \ = B]. 

In fact, from this argument, we see that the larger of the two quantities is deter¬ 
mined just by whether has more neighbors in A or in B. All edges that do not have 
v k+ i as an endpoint contribute the same amount to the two expectations. Our deran- 
domized algorithm therefore has the following simple form: Take the vertices in some 
order. Place the first vertex arbitrarily in A. Place each successive vertex to maximize 
the number of edges crossing the cut. Equivalently, place each vertex on the side with 
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fewer neighbors, breaking ties arbitrarily. This is a simple greedy algorithm, and our 
analysis shows that it always guarantees a cut with at least m/2 edges. 


6.4. Sample and Modify 

Thus far we have used the probabilistic method to construct random structures with 
the desired properties directly. In some cases it is easier to work indirectly, breaking 
the argument into two stages. In the first stage we construct a random structure that 
does not have the required properties. In the second stage we then modify the ran¬ 
dom structure so that it does have the required property. We give two examples of this 
sample-and-modify technique. 


6.4.1. Application: Independent Sets 

An independent set in a graph G is a set of vertices with no edges between them. 
Finding the largest independent set in a graph is an NP-hard problem. The following 
theorem shows that the probabilistic method can yield bounds on the size of the largest 
independent set of a graph. 


Theorem 6.5: Let G = (V, E) be a graph on n vertices with m edges. Then G has an 
independent set with at least rr/Am vertices. 


Proof: Let d — 2m/n be the average degree of the vertices in G. Consider the follow¬ 
ing randomized algorithm. 

1. Delete each vertex of G (together with its incident edges) independently with prob¬ 
ability 1 - 1 fd. 

2. For each remaining edge, remove it and one of its adjacent vertices. 


The remaining vertices form an independent set, since all edges have been removed. 
This is an example of the sample-and-modify technique. We first sample the vertices, 
and then we modify the remaining graph. 

Let X be the number of vertices that survive the first step of the algorithm. Since the 
graph has n vertices and since each vertex survives with probability 1 /d, it follows that 


n 

E[X] = -. 
a 

Let Y be the number of edges that survive the first step. There are nd/2 edges in the 
graph, and an edge survives if and only if its two adjacent vertices survive. Thus 


E[n = 



n 

2d 


The second step of the algorithm removes all the remaining edges and at most Y 
vertices. When the algorithm terminates, it outputs an independent set of size at least 


X -Y, and 


E[X - F]= 


n 

d 


n n 
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The expected size of the independent set generated by the algorithm is n/2d, so the 
graph has an independent set with at least n/2d = n 2 /4m vertices. ■ 

6.4.2. Application: Graphs with Large Girth 


As another example we consider the girth of a graph, which is the length of its smallest 
cycle. Intuitively we expect dense graphs to have small girth. We can show, however, 
that there are dense graphs with relatively large girth. 

Theorem 6 . 6 : For any integer k > 3 there is a graph with n nodes, at least 
edges, and girth at least k. 

Proof: We first sample a random graph G e G n , p with p = n x/k ~ x . Let X be the num¬ 
ber of edges in the graph. Then 




Let Y be the number of cycles in the graph of length at most ^ - 1. Any specific 
possible cycle of length /, where ?> < i < k — \, occurs with probability p 1 . Also, 
there are possible cycles of length i ; to see this, first consider choosing the i 

vertices, then consider the possible orders, and finally keep in mind that reversing the 
order yields the same cycle. Hence, 


E[n = 




P ' < J2 ni p ! = J2 ni/k 


< kn (k ~ [)/k . 


We modify the original randomly chosen graph G by eliminating one edge from 
each cycle of length up to A: - 1. The modified graph therefore has girth at least k. 
When n is sufficiently large, the expected number of edges in the resulting graph is 

E[X -Y}>-(\- -\? 1/A+1 - kn [k ~ V)/k > -n Wk+ \ 

2 V "/ 4 

Hence there exists a graph with at least + edges and girth at least k. ■ 


6.5. The Second Moment Method 


The second moment method is another useful way to apply the probabilistic method. 
The standard approach typically makes use of the following inequality, which is easily 
derived from Chebyshev’s inequality. 


Theorem 6.7: If X is a nonnegative integer-valued random variable, then 


Pr(X = 0) < 


Var[X] 

mx ]) 2 


( 6 . 2 ) 
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Proof: 


Pr(X = 0) < Pr(\X - E[X]| > E[X]) < 


Var[X] 

(E[X])2 


6.5.1. Application: Threshold Behavior in Random Graphs 

The second moment method can be used to prove the threshold behavior of certain ran¬ 
dom graph properties. That is, in the G n , p model it is often the case that there is a 
threshold function f such that: (a) when p is just less than /(»), almost no graph has 
the desired property; whereas (b) when p is just larger than f(n). almost every graph 
has the desired property. We present here a relatively simple example. 


Theorem 6.8: In G tup , suppose that p = /(/?), where f(n) = o(n~ 2/3 ). Then, for 
any s > 0 and for sufficiently large n, the probability that a random graph chosen from 
G, up has a clique of four or more vertices is less than s. Similarly, iff(n) — co(n~ 2 ^ 3 ) 
then, for sufficiently large n, the probability that a random graph chosen from G„. p 
does not have a clique with four or more vertices is less than s. 


Proof: We first consider the case in which p — f(n) and f(n) — o(n~ 2/3 ). Let 
Ci__ be an enumeration of all the subsets of four vertices in G. Let 


Let 


so that 



if Ci is a 4-clique, 
otherwise. 


(4) 

/ =1 



In this case E[X] = o(l), which means that E[X] < s for sufficiently larger. Since X is 
a nonnegative integer-valued random variable, it follows that Pr(X > 1) < E[X] < s. 
Hence, the probability that a random graph chosen from G, up has a clique of four or 
more vertices is less than s. 

We now consider the case when p = f(n) and f(n) = co(n~ 1/3 ). In this case, 
E[X] - > 00 as n grows large. This in itself is not sufficient to conclude that, with high 
probability, a graph chosen random from G n ' p has a clique of at least four vertices. 
We can, however, use Theorem 6.7 to prove that Pr(X = 0) = 6»(1) in this case. To 
do so we must show thatVar[X] = o((E[X]) 2 ). Here we shall compute the variance 
directly; an alternative approach is given as Exercise 6.11. 

We begin with the following useful formula. 


Lemma 6.9: Let F/, i = 1,..., w, be 0-1 random variables, and let Y = YllLi ^i- 
Then 

Var[K] <E[F] + Cov(^-,F 7 ). 

i^j 
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Proof: For any sequence of random variables Y\,.Y m , 

~ m -i m 

Var = ^Var[F ( ] + Cov(^,F y ). 

-/=1 」 /=1 

This is the generalization of Theorem 3.2 to m variables. 

When Yi is a 0-1 random variable, E[y) 2 ] = E[K] and so 

Var[F / ] = E[^ 2 ]-(E[^]) 2 <E[^], 

giving the lemma. □ 


We wish to compute 


-CO 1 

Var[X] = Var 


Applying Lemma 6.9, we see that we need to consider the covariance of the X,-. If 
IC,. H Cj I =0 then the corresponding cliques are disjoint, and it follows that X, and Xj 
are independent. Hence, in this case, E[X,Xj] — EfX/JEfXy] = 0. The same is true if 

\c i nc J \ = i. 

If I C ； n Cj I = 2, then the corresponding cliques share one edge. For both cliques 
to be in the graph, the eleven corresponding edges must appear in the graph. Hence, 
in this case E[H] — E[X ; -]E[X 7 ] < E[X ； Xj] < p n . There are (J) ways to choose 
the six vertices and (•> . 含 . 2 ) ways to split them into C, and Cj (because we choose two 


vertices for C ； H Cj, two for C, alone, and two for Cj alone). 

If |C ( - n C 7 1 = 3, then the corresponding cliques share three edges. For both cliques 
to be in the graph, the nine corresponding edges must appear in the graph. Hence, in 
this case E[X ； Xj] - E[X/]E[Xy] < E[XiXj] < p 9 . There are ( 5 ) ways to choose the 
five vertices, and ( 3 5 H ) ways to split them into C ； and Cj. 


Finally, recall again that E[X] = (l)p 6 and p = f(n) = co(n~ 2/?, ). Therefore, 

6 


v ar[ x]d 八 2;2:2 厂 




,3；1；1 


\p 9 = o(«V 2 ) = o((E[X]) 2 ), 


since 


(E[X]f = 



=0(/?V 2 ). 


Theorem 6.7 now applies, showing that Pr(X = 0) = o(l) and thus the second part of 
the theorem. ■ 


6.6. The Conditional Expectation Inequality 

For a sum of Bernoulli random variables, we can derive an alternative to the second 
moment method that is often easier to apply. 
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Theorem 6.10: Let X = 久卜 where each X, is a 0-1 random variable. Then 

’’ Pr(X ； = 1) 


Pr(X >0)>J2 


(=i E[X I = 1] 

Notice that the X ； need not be independent for Eqn. (6.3) to hold. 


(6.3) 


Proof: Let Y — \/X if X > 0, with F = 0 otherwise. Then 
Pr(X > 0) = E[XY]. 


However, 


E[XY] = E 




[E[H] 

/ 二 1 
fl 

I X/ = l]Pr(X ; = !) +E[X ( T | X ； = 0]Pr(X ( = 0)) 

/ = 1 
n 

J2 E[Y I= n p m. = i) 


=^E[l/X I X ； = lJPr(X ; = 1) 

Pr(X ; = 1) 

E[X I = 1]' 

The key step is from the third to the fourth line, where we use conditional expectations 
in a fruitful way by taking advantage of the fact that E[X ( -F | X ； = 0] = 0. The last 
line makes use of Jensen's inequality, with the convex function f(x) = 1 /x. ■ 


i — l 


We can use Theorem 6.10 to give an alternate proof of Theorem 6.8. Specifically, if 
P = f(n)= co(n~ 2/3 ), we use Theorem 6.10 to show that, for any constant f > 0 and 
tor sufficiently large n, the probability that a random graph chosen from G, up does not 
have a clique with four or more vertices is less than s. 

As in the proof of Theorem 6.8, let X = 兄 ， where X, is 1 if the subset of four 
vertices C,. is a 4-clique and 0 otherwise. For a specific Xj, we have Pr( X : = \) = p b • 
Using the linearity of expectations, we compute 


-co 

E\X I Xj = 1] = E J2 X i I X .i = 


O 

= I x / = 1] - 


Conditioning on X f = 1, we now compute E[X/ | X/ = 1] by using that, for a 0-1 
random variable, 


E[X, I X/ = lj = Pr(X, = 1 I Xj = 1). 
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There are ( ;, ~ 4 ) sets of vertices C/ that do not intersect Q. Each corresponding X ； 
is 1 with probability p 6 . Similarly, X,- = 1 with probability p 6 for the 4( ,f ~ 4 ) sets C ； 
that have one vertex in common with Cj. 

For the remaining cases, we have Pr(X ； = 1 \ Xj = l) = p 5 for the 6( w ~ 4 ) sets 
Ci that have two vertices in common with Cj and Pr(X/ = 1 | = 1) = p 3 for the 

4( M y 4 ) sets Ci that have three vertices in common with C 卜 Summing, we have 

CO 

E[m = l] = J]E [兄 I Xj = 1] 



Applying Theorem 6.10 yields 


Pr(X > 0) > 


_0)^_ 

1 十 (" 4 4 )/? 6 + 4 (« 3 4 )/ ? 6 + 6 ( 1 4 )〆 + 4 ("- 4 )p 


which approaches 1 as n grows large when p = f{n) = co(n~ 2 ^ 3 ). 


6.7. The Lovasz Local Lemma 


One of the most elegant and useful tools in applying the probabilistic method is the Lo¬ 
vasz local lemma. Let £i, be a set of bad events in some probability space. We 

want to show that there is an element in the sample space that is not included in any of 
the bad events. 

This would be easy to do if the events were mutually independent. Recall that events 
£i, £2 . E„ are mutually independent if and only if, for any subset / c [1, n], 


Pr(^Q^) =Il Pr (^)- 


Also, if E\,.... E n are mutually independent then so are £ 1 ,.. E n . (This was left as 
Exercise 1.20.) If Pr(£/) < 1 for all /, then 


Pr(f^) = f|Pr ( 瓦 ) >0 ， 

and there is an element of the sample space that is not included in any bad event. 

Mutual independence is too much to ask for in many arguments. The Lovasz local 
lemma generalizes the preceding argument to the case where the n events are not mu¬ 
tually independent but the dependency is limited. Specifically, following from the 
definition of mutual independence, we say that an event E is mutually independent of 
the events E\, E 2 ,..., E fl if, for any subset I c [1, n], 

Pr(£ I 门勾 ) =Pr(£). 

^ ,/e/ ' 

The dependency between events can be represented in terms of a dependency graph. 
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Definition 6.1: A dependency graph for a set of events E\ __ E„ is a graph G = 

(V, E) such that V = {1,..., n} and, fori = \,... event E[ is mutually independent 
of the events {£ / | (z, j) ^ E). 

We discuss first a special case, the symmetric version of the Lovasz local lemma, which 
is more intuitive and is sufficient for most algorithmic applications. 

Theorem 6.11 [Lovasz Local Lemma]: Let E\ __ E n be a set of events, and assume 

that the following hold: 

1. for all /, Pr(£ ; ) < p: 

2. the degree of the dependency graph given by E\,..E„ is bounded by d ; 

3. Adp < 1. 


Then 



Proof: Let S C {1,We prove by induction on s = 0,... ,n — \ that, if |5| < .v, 
then for all ^ ^ 5 we have 

pr (^ in^./) ^ 2 ^- 

\ ./e5. ' 

For this expression to be well-defined when S is not empty, we need Pr( f] - eS E,) > 0. 

The base case 5 = 0 follows from the assumption that Pr(£^) < /?. To perform the 
inductive step, we first show that Pr( Q - eS >0. This is true when s — 1, because 
Pi (Ej) > \ — p > 0. For s > l, without loss of generality let S = {1,2,..., Then 

pr (n^)=n pr (^ iri^) 

=n(,-p r ( £i in^)) 

.s 

> fjd - 2 尸 ) > 0. 

/ =i 

In obtaining the last line we used the induction hypothesis. 

For the rest of the induction, let 5) = {7 e 5 | (k, j) e E} and 心 = 5 — 5!. If S 2 = 
S then E k is mutually independent of the events i e 5, and 

Pr(^ I p|^) = Pr ( £ A)£/^. 

V .i^s ' 

We continue with the case | 52 | < It will be helpful to introduce the following nota¬ 
tion. Let F s be defined by 


139 








THE PROBABILISTIC METHOD 


&=n 岛， 

,/e-S' 

and similarly define F S] and F S2 . Notice that F s = F S] n F s ,. 

Applying the definition of conditional probability yields 

邮 丨。) = Pr / s> . (64) 

Applying the definition of conditional probability to the numerator of (6.4), we obtain 
Pr(£, n Fs) = Pr(£, n & n 心 2 ) 

= Pr(E k n F Sl I F, 2 )Pr(F S2 ). 

The denominator can be written as 

Pr(F s ) =Pr(F,, flF S2 ) 

=PrC^, I F Sl )Pr(F S2 ). 

Canceling the common factor, which we have already shown to be nonzero, yields 

Pr(E k n F Sl I F Sl ) 


Pr(^A I F s ) 


Pr(^, I Fs.) 


(6.5) 


Note that (6.5) is valid even when 5 2 = 0. 

Since the probability of an intersection of events is bounded by the probability of 
any one of the events and since E k is independent of the events in S 2 , we can bound 
the numerator of (6.5) by 

Pr(E k n F S] I F S2 ) < Pr(E k \ F Sl ) = Pr(E k ) < p. 

Because |5 2 | < |S| = we can apply the induction hypothesis to 

Pr (仏 I F S2 ) = PrU I p| £,Y 

Using also the fact that |5i| < d, we establish a lower bound on the denominator of 
(6.5) as follows: 


Pr(F S| I F s .) = Pr 


(n 右 in 

j eS\ y € 5 2 


Ej 


& In 马 

v j^s 2 ’ 

> 1 - 2nd 


II 2 
>1 
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Using the upper bound for the numerator and the lower bound for the denominator, 
we prove the induction: 


Pr(E k I F s ) 


Pr(E k n F Sl I 


PrC^-, I F Sl ) 


The theorem follows from 


P 

m 


2 /?. 


pr (n^)=fi pr (^ iri^) 

= n(,-p r ( £ ,in^)) 

n 

> f](l - 2/7) > o. 


6.7.1. Application: Edge-Disjoint Paths 

Assume that w pairs of users need to communicate using edge-disjoint paths on a given 
network. Each pair i = 1,...,« can choose a path from a collection F, of m paths. We 
show using the Lovasz local lemma that, if the possible paths do not share too many 
edges, then there is a way to choose n edge-disjoint paths connecting the n pairs. 

Theorem 6.12: If any path in shares edges with no more than k paths in F” where 
i j and %nkjm < 1, then there is a way to choose n edge-disjoint paths connecting 
the n pairs. 


Proof: Consider the probability space defined by each pair choosing a path indepen¬ 
dently and uniformly at random from its set of m paths. Define E it j to represent the 
c\ ent that the paths chosen by pairs i and j share at least one edge. Since a path in F ； 
shares edges with no more than k paths in Fj, 

p = Pr(E i , j ) < 

m 

Let d be the degree of the dependency graph. Since event E, / is independent of all 
c\ents Ei^j' when V ♦ {/, j) and j' ♦ {i, j}, we have d < 2n. Since 


4dp 


名 nk 


< 1 , 


all of the conditions of the Lovasz local lemma are satisfied, proving 


Pr 




Hence, there is a choice of paths such that the n paths are edge disjoint. 
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6.7.2. Application: Satisfiability 

As a second example, we return to the satisfiability question. For the 众 -satisfiability 
( 众 -SAT) problem, the formula is restricted so that each clause has exactly k literals. 
Again, we assume that no clause contains both a literal and its negation, as these clauses 
are trivial. We prove that any k-SAI formula in which no variable appears in too many 
clauses has a satisfying assignment. 

Theorem 6.13: If no variable in a k-SAT formula appears in more than T = 2 k /Ak 
clauses, then the formula has a satisfying assignment. 

Proof: Consider the probability space defined by giving a random assignment to the 
variables. For i = let E { denote the event that the zth clause is not satisfied 

by the random assignment. Since each clause has k literals, 

Pr(£ ; ) = 2- k . 


The event E- t is mutually independent of all of the events related to clauses that do 
not share variables with clause i. Because each of the k variables in clause i can appear 
in no more than T = 2 k fAk clauses, the degree of the dependency graph is bounded by 
d<kT < 2 k - 2 . 

In this case, 


4dp <4-2 k - 2 2~ k < 1, 

so we can apply the Lovasz local lemma to conclude that 

pr (n^. 

hence there is a satisfying assignment for the formula. ■ 



6.8.* Explicit Constructions Using the Local Lemma 


The Lovasz local lemma proves that a random element in an appropriately defined 
sample space has a nonzero probability of satisfying our requirement. However, this 
probability might be too small for an algorithm that is based on simple sampling. The 
number of objects that we need to sample before we find an element that satisfies our 
requirements might be exponential in the problem size. 

In a number of interesting applications, the existential result of the Lovasz local 
lemma can be used to derive efficient construction algorithms. Although the details 
differ in the specific applications, all the known algorithms are based on a common 
two-phase scheme. In the first phase, a subset of the variables of the problem are as¬ 
signed random values; the remaining variables are deferred to the second stage. The 
subset of variables that are assigned values in the first stage is chosen so that: 
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1. using the local lemma, one can show that the random partial solution fixed in the 
first phase can be extended to a full solution of the problem without modifying any 
of the variables fixed in the first phase; and 

2. the dependency graph H between events defined by the variables deferred to the 
second phase has, with high probability, only small connected components. 

When the dependency graph consists of connected components, a solution for the 
variables of one component can be found independently of the other components. Thus, 
the first phase of the two-phase algorithm breaks the original problem into smaller sub¬ 
problems. Each of the smaller subproblems can then be solved independently in the 
second phase by an exhaustive search. 


6 . 8 . 1 . Application: A Satisfiability Algorithm 

We demonstrate this technique in an algorithm for finding a satisfying assignment for 
a 众 -SAT formula. The explicit construction result will be significantly weaker than the 
existence result proven in the previous section. In particular. \\e obtain a polynomial 
time algorithm only for the case when ^ is a constant. This result is still interesting, 
since for k > 3 the problem of ^-satisfiability is NP-complete. For notational conve¬ 
nience we treat here only the case where ^ is an even constant: the case w here k is an 
odd constant is similar. 

Consider a 众 -SAT formula T, with k an even constant, such that each variable ap¬ 
pears in no more than T — 2 ak clauses for some constant a > 0 determined in the 
proof. Let xi,.. .,.x t be the £ variables and C\,.. C," the m clauses of T. 

Following the outline suggested in Section 6.8, our algorithm for finding a satisfy¬ 
ing assignment for has two phases. Some of the variables are fixed at the first phase, 
and the remaining variables are deferred to the second phase. While executing the first 
phase, we call a clause C, dangerous if both the following conditions hold: 

1. k/2 literals of the clause C, have been fixed; and 

2. C, is not yet satisfied. 

Phase I can be described as follows. Consider the variables .vi.v, sequentially. 

If .y, is not in a dangerous clause, assign it independently and uniformly at random a 
value in {0,1}. 

A clause is a surviving clause if it is not satisfied by the variables fixed in phase I. 
Note that a surviving clause has no more than k/2 of its variables fixed in the first 
phase. A deferred variable is a variable that was not assigned a value in the first phase. 
In phase II. we use exhaustive search in order to assign values to the deferred variables 
and so complete a satisfying assignment for the formula. 

In the next two lemmas we show that 

1. the partial solution computed in phase I can be extended to a full satisfying assign¬ 
ment of T, and 

2. with high probability, the exhaustive search in phase II is completed in time that is 
polynomial in m. 
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Lemma 6.14: There is an assignment of values to the deferred variables such that all 
the surviving clauses are satisfied. 

Proof: Let H = (V, £ ) be a graph on m nodes, where V = {1,.. .,m}, and let (i, j) e 
E if and only if C/ fl C ； # 0. That is, H is the dependency graph for the original prob¬ 
lem. Let H' — (V\ E f ) be a graph with V r ^ V and E r ^ E such that (a) i e if 
and only if Q is a surviving clause and (b) (/, j) G E' if and only if C, and C ; share 
a deferred variable. In the following discussion we do not distinguish between node i 
and clause i. 

Consider the probability space defined by assigning a random value in {0,1} in¬ 
dependently to each deferred variable. The assignment of values to the nondeferred 
variables in phase I, together with the random assignment of values to the deferred vari¬ 
ables, defines an assignment to all the t variables. For / = 1,..., let E ； be the event 
that surviving clause C,- is not satisfied by this assignment. Associate the event E, with 
node i in V\ The graph H' is then the dependency graph for this set of events. 

A surviving clause has at least k/2 deferred variables, so 

p = PriEi) < 2 — 人 /2 . 

A variable appears in no more than T clauses; therefore, the degree of the dependency 
graph is bounded by 


d^kT < k2 ak . 


For a sufficiently small constant a > 0, 

4dp = 4k2 ak 2~ k/2 < 1 

and so, by the Lovasz local lemma, there is an assignment for the deferred variables 
that - together with the assignment of values to variables in phase I — satisfies the 
formula. ■ 

The assignment of values to a subset of the variables in phase I partitions the problem 
into as many as m independent subformulas, so that each deferred variable appears in 
only one subformula. The subformulas are given by the connected components of H'. 
If we can show that each connected component in // / has size O(logm), then each sub¬ 
formula will have no more than 0(k log m ) deferred variables. An exhaustive search 
of all the possible assignments for all variables in each subformula can then be done in 
polynomial time. Hence we focus on the following lemma. 


Lemma 6.15: All connected components in H' are of size (9 (log m) with probability 
l-o(l). 


Proof: Consider a connected component R of /• vertices in H. If /? is a connected com¬ 
ponent in H \ then all its r nodes are surviving clauses. A surviving clause is either 
a dangerous clause or it shares at least one deferred variable with a dangerous clause 
(i.e., it has a neighbor in H r that is a dangerous clause). The probability that a given 
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clause is dangerous is at most 2- k/2 , since exactly k/2oi its variables were given ran¬ 
dom values in phase I yet none of these values satisfied the clause. The probability 
that a given clause survives is the probability that either this clause or at least one of its 
direct neighbors is dangerous, which is bounded by 

W + 1)2— A/2 ， 


where again d = kT > 1. 

If the survival of individual clauses were independent events then we would be in 
excellent shape. However, from our description here it is evident that such events are 
not independent. Instead, we identify a subset of the vertices in R such that the sur¬ 
vival of the clauses represented by the vertices of this subset are independent events. 
A 4-tree S of a connected component R in H is defined as follows: 


1. S is a rooted tree; 

2. any two nodes in S are at distance at least 4 in H: 

3. there can be an edge in S only between two nodes with distance exactly 4 between 
them in H\ 

4. any node of R is either in S or is at distance 3 or less from a node in S. 

Considering the nodes in a 4-tree proves useful because the event that a node u in 
a 4-tree survives and the event that another node r in a 4-tree survives are actually in¬ 
dependent. Any clause that could cause u to survive has distance at least 2 from any 
clause that could cause v to survive. Clauses at distance 2 share no variables, and hence 
the events that they are dangerous are independent. We can take advantage of this inde¬ 
pendence to conclude that, for any 4-tree 5, the probability that the nodes in the 4-tree 
survive is at most 


((d + l)2 —" 2 ) ⑸ 


A maximal 4-tree 5 of a connected component R is the 4-tree with the largest pos¬ 
sible number of vertices. Since the degree of the dependency graph is bounded by d, 
there are no more than 


d + d(d-\)^- d(d — \)(d -\) <d 3 

nodes at distance 3 or less from any given vertex. We therefore claim that a maximal 
4-tree of R must have at least r/cP vertices. Otherwise, when we consider the vertices 
of the maximal 4-tree S and all neighbors within distance 3 or less of these vertices, 
we obtain fewer than r vertices. Hence there must be a vertex of distance at least 4 
from all vertices in S. If this vertex has distance exactly 4 from some vertex in S, then 
it can be added to S and thus S is not maximal, yielding a contradiction. If its distance 
is larger than 4 from all vertices in S, consider any path that brings it closer to 5; such 
a path must eventually pass through a vertex of distance at least 4 from all vertices in 
S and of distance 4 from some vertex in S, again contradicting the maximality of S. 

To show that with probability 1 — <9(1) there is no connected component R of size 
r > c log 2 m for some constant c in H\ we show that there is no 4-tree of H of size 
r/d 3 that survives with probability 1 — r;(l). Since a surviving connected component 
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R would have a maximal 4-tree of size r/d 3 , the absence of such a 4-tree implies the 
absence of such a component. 

We need to count the number of 4-trees of size ^ = r/d 3 in H. We can choose the 
root of the 4-tree in m ways. A tree with root v is uniquely defined by an Eulerian tour 
that starts and ends at v and traverses each edge of the tree twice, once in each direc¬ 
tion. Since an edge of S represents a path of length 4 in H, at each vertex in the 4-tree 
the Eulerian path can continue in as many as J 4 different ways, and therefore the num¬ 
ber of 4-trees of size 5 = r/d 3 in H is bounded by 

m〆) 2 、= md^ d \ 

The probability that the nodes of each such 4-tree survive in //’ is at most 
((j + \)2~ k/1 y = ((j + i)2- k/1 ) r/d \ 

Hence the probability that H' has a connected component of size r is bounded by 

mJ 8r/t/3 ((J + l)2- k/1 ) r/d ' < m 2 1 讀 3)(8tf+2a _ 1/2) = o(l) 

for r > c log 2 m and for a suitably large constant c and a sufficiently small constant 

a > 0. ■ 

Thus, we have the following theorem. 


Theorem 6.16: Consider a k-SAT formula with m clauses, where k is an even con¬ 
stant and each variable appears in up to 2 ak clauses for a sufficiently small constant 
o' > 0. Then there is an algorithm that finds a satisfying assignment for the formula in 
expected time that is polynomial in m. 

Proof: As we have described, if the first phase partitions the problem into subformu¬ 
las involving only 0(k logm) variables, then a solution can be found by solving each 
subformula exhaustively in time that is polynomial in m. The probability of the first 
phase partitioning the problem appropriately is 1 — so we need only run phase I a 
constant number of times on average before obtaining a good partition. The theorem 
follows. ■ 


6.9. Lovasz Local Lemma: The General Case 


For completeness we include the statement and proof of the general case of the Lovasz 
local lemma. 

Theorem 6.17: Let E\,..E n be a set of events in an arbitrary probability space, 
and let G = (V, E) be the dependency graph for these events. Assume there exist 
a'i__ x n G [ 0 , 1 ] such that, for all 1 < i < n, 

Pv(Ei) < x t ]~[ (1 - Xj). 

(i.j)eE 
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Pr ⑽❿ 1 - 4 


Proof: Let S c {1,..., n). We prove by induction on .s = 0,..., that, if |5| < then 
for all k we have 


Pr A If ! 乌 


As in the case of the symmetric version of the local lemma, we must be careful that the 
conditional probability is well-defined. This follows using the same approach as in the 
symmetric case, so we focus on the rest of the induction. 

The base case s = 0 follows from the assumption that 


Pv(E k ) < x k ]~[ (1 - Xj) 


For the inductive step, let S\ = {j e S \ (k, j) e E) and 5^ = 5 - 5i. If & = 5 
then E k is mutually independent of the events i e S ， and 


Pr(^ I n 句 ) =Pr(A) 


We continue with the case |5^| < s. We again use the notation 


=n 


and define F Sl and F s ， similarly, so that F s = F Sl H F Sl . 
Applying the definition of conditional probability yields 


Pr(^ I Fs) 


Pv(E k n F s ) 


Pr(F s ) 

B \ once again applying the definition of conditional probability, the numerator of (6.6) 


be written as 


Pr(E k n F s ) = Pr(£( n I F Si )Pt(F s 


and the denominator as 


Pv(Fs) = Pr(F s , I Fs.)Pr(F s ,). 


'anceling the common factor then yields 


Pr(E k I F s ) 


Pv(E k n F Sl I F, 

Pr(F 5l I Fs.) 


Since the probability of an intersection of events is bounded by the probability of 
each of the events and since E k is independent of the events in Sz ，we can bound the 
numerator of (6.7) by 

Pr (£々 n F 5l I F 52 ) £ Pr(E k \ F Sl ) = Pr(E k ) < x k ]~[ (1 - Xj). 
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To bound the denominator of (6.7), let Si = {j\,..j r }. Applying the induction hy¬ 
pothesis, we have 


Pr(F,, I F Sl )= Mn 岛 in 气 




J^S 2 




> n (i_x /) 

/ =1 

> n 

(kJ)eE 

Using the upper bound for the numerator and the lower bound for the denominator, 
we can prove the induction hypothesis: 

Pr (£々 I P| 句 ) = Pr(E k \ F s ) 

\ j€S ' 

_ Pr(£^ fl F S[ I Fs 2 ) 

= Pv(F Sl I F Sl ) 

< ria.y)e£(l — ■ r /') 

n(/t")e£(l _ x y.) 

=X k , 

The theorem now follows from: 

n 

Pr(E u ...,E n ) = ]~[Pr(£/ | 

/ 二 1 
n 

= f](l - Pr(£ ; I 
/ =1 
n 

2 n (i - x ' )>o . ■ 


6.10. Exercises 

Exercise 6.1: Consider an instance of SAT with m clauses, where every clause has 
exactly k literals. 


(a) Give a Las Vegas algorithm that finds an assignment satisfying at least m(l — 2 — 々） 
clauses, and analyze its expected running time. 

(b) Give a derandomization of the randomized algorithm using the method of condi¬ 
tional expectations. 
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■ every integer there exists a coloring of the edges of 
•lors so that the total number of monochromatic copies 


of K 4 is at most ⑵ 2'、. 

(b) Give a randomized algorithm for finding a coloring with at most ⑵ mono¬ 
chromatic copies of that runs in expected time polynomial in n. 

(c) Show how to construct such a coloring deterministically in polynomial time using 
the method of conditional expectations. 

Exercise 6.3: Given an « -vertex undirected graph G = (V, E), consider the follow¬ 
ing method of generating an independent set. Given a permutation a of the vertices, 
define a subset 5((7) of the vertices as follows: for each vertex i, i e S(a) if and only 
if no neighbor j of i precedes i in the permutation a. 

(a) Show that each S(cr) is an independent set in G. 

(b) Suggest a natural randomized algorithm to produce o for which you can show that 
the expected cardinality of ^(cr) is 


where d { denotes the degree of vertex i. 

(c) Prove that G has an independent set of size at least Y^'i = \ 1/(^/ + D- 

Exercise 6.4: Consider the following two-player game. The game begins with k to¬ 
kens placed at the number 0 on the integer number line spanning [0./?]. Each round, 
one player, called the chooser, selects two disjoint and nonempty sets of tokens A and 
B. (The sets A and B need not cover all the remaining tokens; they only need to be 
disjoint.) The second player, called the remover, takes all the tokens from one of the 
sets off the board. The tokens from the other set all move up one space on the num¬ 
ber line from their current position. The chooser wins if any token ever reaches n. The 
remover wins if the chooser finishes with one token that has not reached n. 

ia) Give a winning strategy for the chooser when k > 2 n . 

I b) Use the probabilistic method to show that there must exist a winning strategy for 
the remover when k < 2 n . 

»c) Explain how to use the method of conditional expectations to derandomize the 
winning strategy for the remover when k < 2 n . 


sing the probabilistic method that, if a graph G has n 
dsts a partition of the n nodes into sets A and B such 
partition. Improve this result slightly: show that there 
it mn/{2n — 1) edges cross the partition. 


e the problem of finding a large cut to finding a large 
le vertices into k disjoint sets, and the value of a cut is 
from one of the k sets to another. In Section 6.2.1 we 
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considered 2-cuts when all edges had the same weight 1, showing via the probabilistic 
method that any graph G with m edges has a cut with value at least m/2. Generalize 
this argument to show that any graph G with m edges has a k-cut with value at least 
(k — \)m/k. Show how to use derandomization (following the argument of Section 6.3) 
to give a deterministic algorithm for finding such a cut. 


Exercise 6.7: A hypergraph // is a pair of sets (V, E), where V is the set of vertices 
and E is the set of hyperedges. Every hyperedge in £ is a subset of V. In particular, an 
r-uniform hypergraph is one where the size of each edge is r. For example, a 2-uniform 
hypergraph is just a standard graph. A dominating set in a hypergraph // is a set of 
vertices S C V such that eDS^idfor every edge e & E. That is, S hits every edge of 
the hypergraph. 

Let H = (V, E) be an r-uniform hypergraph with n vertices and m edges. Show 
that there is a dominating set of size at mostnp + (1 — p) r m for every real number 0 £ 
p < l. Also, show that there is a dominating set of size at most (m + « In r)/r. 


Exercise 6.8: Prove that, for every integer n, there exists a way to 2-color the edges 
of K x so that there is no monochromatic clique of size k when 



(Hint: Start by 2-coloring the edges of K n , then fix things up.) 

Exercise 6.9: A tournament is a graph on n vertices with exactly one directed edge be¬ 
tween each pair of vertices. If vertices represent players, then each edge can be thought 
of as the result of a match between the two players: the edge points to the winner. A 
ranking is an ordering of the n players from best to worst (ties are not allowed). Given 
the outcome of a tournament, one might wish to determine a ranking of the players. A 
ranking is said to disagree with a directed edge from v to x if is ahead of x in the 
ranking (since x beat y in the tournament). 


(a) Prove that, for every tournament, there exists a ranking that disagrees with at most 
50% of the edges. 

(b) Prove that, for sufficiently large n, there exists a tournament such that every rank¬ 
ing disagrees with at least 49% of the edges in the tournament. 


Exercise 6.10: A family of subsets T of {1,2, ...,n} is called an antichain if there is 
no pair of sets A and B \n T satisfying A c B. 


(a) Give an example of J 7 where \T\ = 

(b) Let f k be the number of sets in T with size k. Show that 


St 1 . 
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(Hint: Choose a random permutation of the numbers from \ ton, and let X k = 1 
if the first k numbers in your permutation yield a set in T.\fX = J2'k=o 义人 、 what 
can you say about X?) 

(c) Argue that \T\ < ( L „/ 2 j) for any antichain T. 

Exercise 6.11: In Section 6.5.1, we bounded the variance of the number of 4-cliques in 
a random graph in order to demonstrate the second moment method. Show how to cal¬ 
culate the variance directly by using the equality from Exercise 3.9: for X = X,- 

the sum of Bernoulli random variables, 

n 

E[X 2 ] = = \)E[X I X, = 1]. 

/ =i 

Exercise 6.12: Consider the problem of whether graphs in G n p have cliques of con¬ 
stant size k. Suggest an appropriate threshold function for this property. Generalize 
the argument used for cliques of size 4, using either the second moment method or the 
conditional expectation inequality, to prove that your threshold function is correct for 
cliques of size 5. 

Exercise 6.13: Consider a graph in G „. p ， with p = c In n /n. Use the second moment 
method or the conditional expectation inequality to prove that if (. < 1 then, for any 
constant e > 0 and for n sufficiently large, the graph has isolated vertices with proba¬ 
bility at least 1 - e. 

Exercise 6.14: Consider a graph in G IKp , with p = 1 /n. Let X be the number of tri¬ 
angles in the graph, where a triangle is a clique with three edges. Show that 


Pr(X > 1) < 1/6 

and that 


lim Pr(X > 1) > 1/7. 

n^0C' 

I Hint: Use the conditional expectation inequality.) 


Exercise 6.15: Consider the set-balancing problem of Section 4.4. We claim that there 
1 、 an » x n matrix A for which || is Q(v^) for any choice of b. For convenience 

here we assume that n is even. 


la) We have shown in Eqn. (5.5) that 


n\ < Q-^/ni — 


Using similar ideas, show that 


n\ > a 



for some positive constant a. 
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(b) Let b\,b 2 ,..., b m / 2 all equal 1, and let b m / 2 +\,b m / 2 + 2 ，- all equal —1. Let 
Y\,Y 2 , ■ ■ ■ ,Y m each be chosen independently and uniformly at random from (0,1}. 
Show that there exists a positive constant c such that, for sufficiently large m. 


Pr 




> 

2 


(Hint: Condition on the number of Y t that are equal to 1.) 

(c) Let b\,b 2 ,.. -,b m each be equal to either 1 or —1. Let Y\,Y 2 ,...,Y m each be cho¬ 
sen independently and uniformly at random from {0,1}. Show that there exists a 
positive constant c such that, for sufficiently large m, 




> - 


(d) Prove that there exists a matrix A for which ||A 石 ||oo is Q{^/n ) for any choice of b. 


Exercise 6.16: Use the Lovasz local lemma to show that, if 



then it is possible to color the edges of K n with two colors so that it has no monochro¬ 
matic Kk subgraph. 


Exercise 6.17: Use the general form of the Lovasz local lemma to prove that the sym¬ 
metric version of Theorem 6.11 can be improved by replacing the condition Adp < 1 
by the weaker condition cp(d + 1) £ 1. 


Exercise 6.18: Let G = (V, E) be an undirected graph and suppose each u e V is as¬ 
sociated with a set 5 (l>) of 8r colors, where r > 1. Suppose, in addition, that for each 
v eV and c e 5*(u) there are at most r neighbors u of v such that c lies in S(u). Prove 
that there is a proper coloring of G assigning to each vertex v a color from its class 
such that, for any edge (u, v) G E, the colors assigned to u and v are different. 
You may want to let A luv _ c be the event that u and v are both colored with color c and 
then consider the family of such events. 
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Markov Chains and Random Walks 


Markov chains provide a simple but powerful framework for modeling random pro¬ 
cesses. We start this chapter with the basic definitions related to Markov chains and 
then show how Markov chains can be used to analyze simple randomized algorithms 
for the 2-SAT and 3-SAT problems. Next we study the long-term behavior of Markov 
chains, explaining the classifications of states and conditions for convergence to a sta¬ 
tionary distribution. We apply these techniques to analyzing simple gambling schemes 
and a discrete version of a Markovian queue. Of special interest is the limiting be¬ 
havior of random walks on graphs. We prove bounds on the covering time of a graph 
and use this bound to develop a simple randomized algorithm for the s-t connectiv¬ 
ity problem. Finally, we apply Markov chain techniques to resolve a subtle probability 
problem known as Parrondo’s paradox. 


7.1. Markov Chains: Definitions and Representations 

A stochastic process X = {X(t) : r G T} is a collection of random variables. The index 
! often represents time, and in that case the process X models the value of a random 
variable X that changes over time. 

We call X(t) the state of the process at time t. In what follows, we use X, inter¬ 
changeably with X(t). If, for all t, X, assumes values from a countably infinite set. then 
\\ e say that X is a discrete space process. If X, assumes values from a finite set then the 
process is finite. If 7 is a countably infinite set we say that X is a discrete time process. 

In this chapter we focus on a special type of discrete time, discrete space stochastic 
process Xo, X\, X 2 ,... in which the value of X r depends on the value of X,^\ but not 
on the sequence of states that led the system to that value. 

Definition 7.1: A discrete time stochastic process X 0 .X|, Xi _ is ci Markov chain if 1 

Pr( Xj = ci) I X.f—\ — cif — 1 ^ ^t — 2 ― df—2 ^ — ci 0) — Pr(•X '， = a ； \ 1 ― i) 

= \ .a, - 


Mrictly speaking, this is a time-homogeneous Markov chain; this will be the only type we study in this book. 
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This definition expresses that the state X, depends on the previous state X,_\ but is 
independent of the particular history of how the process arrived at state X t ^\. This is 
called the Markov property or memoryless property, and it is what we mean when we 
say that a chain is Markovian. It is important to note that the Markov property does not 
imply that X, is independent of the random variables X {) , X\, X,_ 2 ； it just implies 
that any dependency of X, on the past is captured in the value of X t _\. 

Without loss of generality, we can assume that the discrete state space of the Mar¬ 
kov chain is {0,1, 2,...,(or {0,1,2,...} if it is countably infinite). The transition 
probability 


P,, = Pr(X, = j I =/) 

is the probability that the process moves from / to j in one step. The Markov property 
implies that the Markov chain is uniquely defined by the one-step transition matrix: 


/ Pq.o 

尸 ().1 ■ 

.. P 0J 

• • • \ 

Pi.o 

尸 Li • 

■- P \-j 


尸 , .0 

Pu • 

.• Pi.j 


V : 



•/ 


That is, the entry in the ith row and yth column is the transition probability Pj j. It 
follows that, for all /, [y >0 Pi.j = 1- 

This transition matrix representation of a Markov chain is convenient for computing 
the distribution of future states of the process. Let pj(t) denote the probability that the 
process is at state / at time t. Let p(t) = ( po(t), p\(t), p2(t), . ■.) be the vector giving 
the distribution of the state of the chain at time t. Summing over all possible states at 
time r - 1, we have 


Pi(t) = - \)Pj.i 

,/>o 


p{t) = p(t - 1)P. 

We represent the probability distribution as a row vector and multiply pP instead 
of Vp to conform with the interpretation that starting with a distribution p(t - 1) and 
applying the operand P, we arrive at the distribution p(t). 

For any m > 0, we define the /n-step transition probability 

P-：j = Pr(X t+m = j I X, =/) 

as the probability that the chain moves from state i to state j in exactly m steps. 
Conditioning on the first transition from /, we have 

（7. 1 ) 

k>0 


Operations on vectors are generalized to a countable number of elements in the natural way. 
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Figure 7.1: A Markov chain (left) and the corresponding transition matrix [right). 


Let P (m ) be the matrix whose entries are the m-step transition probabilities, so that the 
entry in the /th row and y'th column is Pr」. Then, applying Eqn. (7.1) yields 


by induction on m. 

Thus, for any t > 0 and m > \, 


p(m) _ p p(m-l). 
p(ni) _ pm 


p(t + m) = p(t)P m . 


Another useful representation of a Markov chain is by a directed, weighted graph 
D = (V, E, w). The set of vertices of the graph is the set of states of the chain. There 
is a directed edge (i,j) G E if and only if P,.. ,■ > 0, in which case the weight w(i, j) 
of the edge (/', 7 ) is given by w(i,j) = P,, y . Self-loops, where an edge starts and ends 
at the same vertex, are allowed. Again, for each / we require that ^ r(( J]cE w{i, j )= 
1. A sequence of states visited by the process is represented by a directed path on the 
graph. The probability that the process follows this path is the product of the weights 
of the path’s edges. 

Figure 7.1 gives an example of a Markov chain and the correspondence between 
the two representations. Let us consider how we might calculate with each represen¬ 
tation the probability of going from state 0 to state 3 in exactly three steps. With the 
graph, we consider all the paths that go from state 0 to state 3 in exactly three steps. 
There are only four such paths: 0-1-0-3, 0—1—3—3, 0—3—1—3, and 0—3—3—3. The prob¬ 
abilities that the process follows each of these paths are 3/32, 1/96, 1/16, and 3/64, 
respectively. Summing these probabilities, we find that the total probability is 41/192. 
Alternatively, we can simply compute 


3/16 7/48 29/64 41/192 一 

5/48 5/24 79/144 5/36 

0 0 1 0 


1/16 13/96 107/192 47/192 


The entry Pq 3 = 41/192 gives the correct answer. The matrix is also helpful if we want 
to know the probability of ending in state 3 after three steps when we begin in a state 
chosen uniformly at random from the four states. This can be computed by calculating 
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(l/4 ， l/4, 1/4, 1/4)P 3 = (17/192,47/384,737/1152,43/288); 
here the last entry, 43/288, is the required answer. 


7.1.1. Application: A Randomized Algorithm for 2-Satisfiability 


Recall from Section 6.2.2 that an input to the general satisfiability (SAT) problem is a 
Boolean formula given as the conjunction (AND) of a set of clauses, where each clause 
is the disjunction (OR) of literals and where a literal is a Boolean variable or the nega¬ 
tion of a Boolean variable. A solution to an instance of a SAT formula is an assignment 
of the variables to the values True (T) and False (F) such that all the clauses are sat¬ 
isfied. The general SAT problem is NP-hard. We analyze here a simple randomized 
algorithm for 2-SAT, a restricted case of the problem that is solvable in polynomial 
time. 

For the ^-satisfiability (k-SAI) problem, the satisfiability formula is restricted so 
that each clause has exactly k literals. Hence an input for 2-SAT has exactly two literals 
per clause. The following expression is an instance of 2-SAT: 


(Xi V X^) A (xi V A (xi V X 2 ) A (X 4 V J^) A (X 4 V 刃 ）. (7.2) 


One natural approach to finding a solution for a 2-SAT formula is to start with an 
assignment, look for a clause that is not satisfied, and change the assignment so that 
the clause becomes satisfied. If there are two literals in the clause, then there are two 
possible changes to the assignment that will satisfy the clause. Our 2-SAT algorithm 
(Algorithm 7.1) decides which of these changes to try randomly. In the algorithm, n 
denotes the number of variables in the formula and m is an integer parameter that de¬ 
termines the probability that the algorithm terminates with a correct answer. 

In the instance given in (7.2), if we begin with all variables set to False then the 
clause (x\ v ^2) is not satisfied. The algorithm might therefore choose this clause and 
then select xi to be set to True. In this case the clause (14 v T\) would be unsatisfied 
and the algorithm might switch the value of a variable in that clause, and so on. 

If the algorithm terminates with a truth assignment, it clearly returns a correct an¬ 
swer. The case where the algorithm does not find a truth assignment requires some 
care, and we will return to this point later. Assume for now that the formula is satis- 
fiable and that the algorithm will actually run as long as necessary to find a satisfying 
truth assignment. 

We are mainly interested in the number of iterations of the while-loop executed by 
the algorithm. We refer to each time the algorithm changes a truth assignment as a step. 
Since a 2-SAT formula has 0{n 2 ) distinct clauses, each step can be executed in 0(n 2 ) 
time. Faster implementations are possible but we do not consider them here. Let S 
represent a satisfying assignment for the n variables and let Aj represent the variable 
assignment after the ith step of the algorithm. Let X, denote the number of variables in 
the current assignment A, that have the same value as in the satisfying assignment S. 
When X t = n, the algorithm terminates with a satisfying assignment. In fact, the algo¬ 
rithm could terminate before X ； reaches n if it finds another satisfying assignment, but 
for our analysis the worst case is that the algorithm only stops when X ； = n. Starting 
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REPRESENTAT 


2-SAT Algorithm: 

1. Start with an arbitrary truth assignment. 

2. Repeat up to 2mn 2 times, terminating if all clauses are satisfied: 

(a) Choose an arbitrary clause that is not satisfied. 

(b) Choose uniformly at random one of the literals in the clause and switch 
the value of its variable. 

3. If a valid truth assignment has been found, return it. 

4. Otherwise, return that the formula is unsatistiable. 


consider how 
s n. 

0 then, for : 


F 

that 1 < X ,： 
atisfies the cl 
e variables ii 


probability that 
where A,, and S 
the probability t\ 


lse, mat means 
this clause. Be 
/e increase the 
;number of m 


The stochastic 
probability that 
two variables in 
might depend on 


= 0 )= 


1 2. Yi increases with probability exactly 1/2. It is therefore clear that the expected 
time to reach n starting from any point is larger for the Markov chain Y than for the 
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process X, and we use this fact hereafter. (A stronger formal framework for such ideas 
is developed in Chapter 11.) 

This Markov chain models a random walk on an undirected graph G. (We elabo¬ 
rate further on random walks in Section 7.4.) The vertices of G are the integers 0, ...,n 
and, for 1 < /' < n — node / is connected to node / — 1 and node / 十 1. Let hj be the 
expected number of steps to reach n when starting from j. For the 2-SAT algorithm, 
hj is an upper bound on the expected number of steps to fully match S when starting 
from a truth assignment that matches S in j locations. 

Clearly, h n = 0 and ho = h\ + 1, since from h (] we always move to Ai in one step. 
We use linearity of expectations to find an expression for other values of hj. Let Z y be 
a random variable representing the number of steps to reach n from state j. Now con¬ 
sider starting from state y, where \ < j < n — With probability 1/2, the next state 
is j — 1, and in this case Zy = 1 + Z , ■一卜 With probability 1/2, the next step is y + 1, 
and in this case Zy = 1 + Zj + \. Hence 

E[Z 7 ]=E ^(l + Z 7 _,) + i(l + Z 7+1 ). 

But E[Z y ] = hj and so, by applying the linearity of expectations, we obtain 

么 : i + 1 十 ^+L± .. . 1 = 十 ^ 十 i 
J 2 2 2 2 

We therefore have the following system of equations: 
h n — 0; 

hj = ¥ 十 十 l \<j<n-U 
ho = h\ + \. 

We can show inductively that, for 0 < j < n — \, 
hj = h J+ i + 2j + 1. 

It is true when j = 0, since h\ = h {) — 1. For other values of y, we use the equation 

2 2 

to obtain 

hj+\ = 2hj — hj-\ — 2 

= 2/?厂（~+2() — 1) + 1)—2 

=hj - V - 1, 

using the induction hypothesis in the second line. We can conclude that 

/ 卜 1 

+ 1 = ^2 + 1 + 3 = * * * = 2/ + 1 = n 2 . 

/^0 
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An alternative approach for solving the system of equations for the hj is to guess 
and verify the solution hj = n 2 — j 2 . The system has n + 1 linearly independent equa¬ 
tions and n + 1 unknowns, and hence there is a unique solution for each value of n. 
Therefore, if this solution satisfies the foregoing equations then it must be correct. We 
have h n — 0. For 1 < j < a? — 1, we check 

,« 2 I) 2 , n 2 -( j + \) 2 , t 

h J= - 2 - + - 2 - + 1 

2 

=n -j- 

and 


乃 o = (,r — 1 ) 十 1 
= n". 


Thus we have proven the following fact. 


Lemma 7.1: Assume that a 2-SATformula with n variables has a satisfying assignment 
and that the 2-SAT algorithm is allowed to run until it finds a satisfying assignment. 
Then the expected number of steps until the algorithm finds an assignment is at most n 2 . 


We now return to the issue of dealing with unsatisfiable formulas by forcing the algo¬ 
rithm to stop after a fixed number of steps. 


Theorem 7.2: The 2-SAT algorithm always returns a correct answer if the formula is 
unsatisfiable. If the formula is satisfiable, then with probability at least 1 — 2- m the 
algorithm returns a satisfying assignment. Otherwise, it incorrectly returns that the 
formula is unsatisfiable. 

Proof: It is clear that if there is no satisfying assignment then the algorithm correctly 
returns that the formula is unsatisfiable. Suppose the formula is satisfiable. Divide the 
execution of the algorithm into segments of In 2 steps each. Given that no satisfying as¬ 
signment was found in the first / — 1 segments, what is the conditional probability that 
the algorithm did not find a satisfying assignment in the /th segment? By Lemma 7.1, 
the expected time to find a satisfying assignment, regardless of its starting position, 
卜 bounded by n 2 . Let Z be the number of steps from the start of segment i until the 
algorithm finds a satisfying assignment. Applying Markov’s inequality. 

2 n 2 1 
Pr(Z > 2n 2 ) < ^ =-. 

Thus the probability that the algorithm fails to find a satisfying assignment after m seg¬ 
ments is bounded above by (1/2)’”. ■ 


7.1.2. Application: A Randomized Algorithm for 3-Satisfiability 

We now generalize the technique used to develop an algorithm for 2-SAT to obtain a 
randomized algorithm for 3-SAT. This problem is NP-complete, so it would be rather 
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3-SAT Algorithm: 

1. Start with an arbitrary truth assignment. 

2. Repeat up to m times, terminating if all clauses are satisfied: 

(a) Choose an arbitrary clause that is not satisfied. 

(b) Choose one of the literals uniformly at random, and change the value of 
the variable in the current truth assignment. 

3. If a valid truth assignment has been found, return it. 

4. Otherwise, return that the formula is unsatisfiable. 


Algorithm 7.2: 3-SAT algorithm. 


surprising if a randomized algorithm could solve the problem in expected time polyno¬ 
mial in n? We present a randomized 3-SAT algorithm that solves 3-SAT in expected 
time that is exponential in but it is much more efficient than the naive approach of 
trying all possible truth assignments for the variables. 

Let us first consider the performance of a variant of the randomized 2-SAT algorithm 
when applied to a 3-SAT problem. The basic approach is the same as in the previous 
section; see Algorithm 7.2. In the algorithm, m is a parameter that controls the prob¬ 
ability of success of the algorithm. We focus on bounding the expected time to reach 
a satisfying assignment (assuming one exists), as the argument of Theorem 7.2 can be 
extended once such a bound is found. 

As in the analysis of the 2-SAT algorithm, assume that the formula is satisfiable and 
let 5 be a satisfying assignment. Let the assignment after /' steps of the process be Aj, 
and let X ； be the number of variables in the current assignment A, that match S. It 
follows from the same reasoning as for the 2-SAT algorithm that, for I < j < n — l, 

Pr(X ;+1 = ./+ 1 I X, = ./) > 1/3; 

Pr(X, +1 = j- \ I = j) < 2/3. 


These equations follow because at each step we choose an unsatisfied clause, so A, and 
S must disagree on at least one variable in this clause. With probability at least 1/3, we 
increase the number of matches between the current truth assignment and S. Again we 
can obtain an upper bound on the expected number of steps until X-, = nby analyzing 
a Markov chain Yq,Y\, ... such that Ko = X {) and 

Pr(r, +1 = 11 y：- =0) = 1, 

Pr(h+i = y + 1 I Yi = j) = 1/3, 

Pr(y ； ■十 i = y - 1 I Yi = j) = 2/3. 

3 Technically, this would not settle the P = NP question, since we would be using a randomized algorithm and 
not a deterministic algorithm to solve an NP-hard problem. It would, however, have similar far-reaching im¬ 
plications about the ability to solve all NP^complete problems. 
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In this case, the chain is more likely to go down than up. If we let hj be the ex¬ 
pected number of steps to reach n when starting from y, then the following equations 
hold for hj : 

h n = 0 ； 

+ + l<j<n -\： 

ho = /^i + 1. 

Again, these equations have a unique solution, which is given by 

h, = 2 奸 2 - 2 J+2 - 3(n - j). 

Alternatively, the solution can be found by using induction to prove the relationship 

hj = hjj r \ + — 3. 

We leave it as an exercise to verify that this solution indeed satisfies the foregoing 
equations. 

The algorithm just described takes 0(2 ?! ) steps on average to find a satisfying as¬ 
signment. This result is not very compelling, since there are only 2 〃 truth assignments 
to try! With some insight, however, we can significantly improve the process. There 
are two key observations. 

1. If we choose an initial truth assignment uniformly at random, then the number of 
variables that match S has a binomial distribution with expectation ； ?/2. With an 
exponentially small but nonnegligible probability, the process starts with an initial 
assignment that matches S in significantly more than n/2 variables. 

2. Once the algorithm starts, it is more likely to move toward 0 than toward n. The 
longer we run the process, the more likely it has moved toward 0. Therefore, we 
are better off restarting the process with many randomly chosen initial assignments 
and running the process each time for a small number of steps, rather than running 
the process for many steps on the same initial assignment. 

Based on these ideas, we consider the modified procedure of Algorithm 7.3. The mod¬ 
ified algorithm has up to 3n steps to reach a satisfying assignment starting from a 
random assignment. If it fails to find a satisfying assignment in 3n steps, it restarts the 
search with a new randomly chosen assignment. We now determine how many times 
the process needs to restart before it reaches a satisfying assignment. 

Let q represent the probability that the modified process reaches 5 (or some other 
satisfying assignment) in 3n steps starting with a truth assignment chosen uniformly 
at random. Let q r be a. lower bound on the probability that our modified algorithm 
reaches S (or some other satisfying assignment) when it starts with a truth assignment 
that includes exactly j variables that do not agree with S. Consider a particle moving 
on the integer line, with probability 1/3 of moving up by one and probability 2/3 of 
moving down by one. Notice that 
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Modified 3-SAT Algorithm: 

1. Repeat up to m times, terminating if all clauses are satisfied: 

(a) Start with a truth assignment chosen uniformly at random. 

(b) Repeat the following up to 3n times, terminating if a satisfying 
assignment is found: 

i. Choose an arbitrary clause that is not satisfied. 

ii. Choose one of the literals uniformly at random, and change the value 
of the variable in the current truth assignment. 

2. If a valid truth assignment has been found, return it. 

3. Otherwise, return that the formula is unsatisfiable. 


Algorithm 7.3: Modified 3-SAT algorithm. 

is the probability of exactly k moves down and k + j moves up in a sequence of j - \-2k 
moves. It is therefore a lower bound on the probability that the algorithm reaches a 
satisfying assignment within y + 2/c < 3n steps, starting with an assignment that has 
exactly j variables that did not agree with S. That is, 


In particular, consider the case where 


i. In that case we have 


In order to approximate (■)’.）we use Stirling’s formula, which is similar to the bounds 
(5.2) and (5.5) we have previously proven for factorials. Stirling’s formula is tighter, 
which proves useful for this application. We use the following loose form. 

Lemma 7.3 [Stirling’s Formula]: For m > 0, 


^2^( ^ I 


Hence, when / > 0, 


(3y)! 

: !(2y)! 

y/27T(3y) 

V^V 2 兀 (2 。 

41 
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for a constant c = \/3/8^/7r t Thus, when j > 0, 



c 1 

> - 

—砂 


Also, = \. 

Having established a lower bound for q -j, we can now derive a lower bound for ", the 
probability that the process reaches a satisfying assignment in 3n steps when starting 
with a random assignment: 


q > ^ Pr(a random assignment has j mismatches with S) ■ qj 

7=0 



^ here in (7.3) we used = (l + ■ 

Assuming that a satisfying assignment exists, the number of random assignments 
the process tries before finding a satisfying assignment is a geometric random vari¬ 
able with parameter q. The expected number of assignments tried is \/q, and for 
each assignment the algorithm uses at most 3n steps. Thus, the expected number 
ot、teps until a solution is found is bounded by 0(n 3/2 (4/3) n ). As in the case of 2- 
SAT (Theorem 7.2), the modified 3-SAT algorithm (Algorithm 7.3) yields a Monte 
Carlo algorithm for the 3-SAT problem. If the expected number of steps until a satis- 
f> ing solution is found is bounded above by a and if m is set to lab. then the prob¬ 
ability that no assignment is found when the formula is satisfiable is bounded above 
b> 


7.2. Classification of States 


A first step in analyzing the long-term behavior of a Markov chain is to classify its 
«aies. In the case of a finite Markov chain, this is equivalent to analyzing the connec- 
trsiiy structure of the directed graph representing the Markov chain. 
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Definition 7.2: State i is accessible/ram state j if ， for some integer n > 0^ P-J > 0- 
If two states i and j are accessible from each other, we say that they communicate and 
we write i j. 


In the graph representation of a chain, / j if and only if there are directed paths 
connecting /' to j and j to /. 

The communicating relation defines an equivalence relation. That is, the communi¬ 
cating relation is 

1. reflexive - for any state i, i ^ /; 

2. symmetric - if / j then j +> /; and 

3. transitive - if / j and j ^ k, then i k. 

Proving this is left as Exercise 7.4. Thus, the communication relation partitions the 
states into disjoint equivalence classes, which we refer to as communicating classes. It 
might be possible to move from one class to another, but in that case it is impossible to 
return to the first class. 


Definition 7.3: A Markov chain is irreducible if all states belong to one communicat¬ 
ing class. 


In other words, a Markov chain is irreducible if, for every pair of states, there is a 
nonzero probability that the first state can reach the second. We thus have the following 
lemma. 


Lemma 7 . 4 : A finite Markov chain is irreducible if and only if its graph representation 
is a strongly connected graph. 

Next we distinguish between transient and recurrent states. Let rj - denote the proba¬ 
bility that, starting at state / , the first transition to state j occurs at time r ； that is, 

rj j — Pr(X r = j and, for 1 < 5 < r — 1, ^ j \ Xq — i). 

Definition 7 . 4 : A state is recurrent if Ylt>\ r !i = 1, and it is transient if Ylt>\ r i i < *■ 
A Markov chain is recurrent if every state in the chain is recurrent. 

If state i is recurrent then, once the chain visits that state, it will (with probability 1) 
eventually return to that state. Hence the chain will visit state / over and over again, 
infinitely often. On the other hand, if state i is transient then, starting at /, the chain 
will return to z with some fixed probability p — ^ r>1 r/,.. In this case, the number of 
times the chain visits i when starting at z is given by a geometric random variable. If 
one state in a communicating class is transient (respectively, recurrent) then all states 
in that class are transient (respectively, recurrent); proving this is left as Exercise 7.5. 

We denote the expected time to return to state i when starting at state / by h Li = 
Yh >\ r ' r !i - Similarly, for any pair of states / and j, we denote by h Lj — J2r>\ t ' r \ j 
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the expected time to first reach j from state /. It may seem that if a chain is recurrent, 
so that we visit a state i infinitely often, then h u should be finite. This is not the case, 
which leads us to the following definition. 

Definition 7.5: A recurrent state i is positive recurrent ifhj.j < oc. Otherwise, it is 
null recurrent. 


To give an example of a Markov chain that has null recurrent states, consider a chain 
whose states are the positive integers. From state /, the probability of going to state 
/ 十 1 is //(/ 十 1). With probability 1 /(/ •十 1)，the chain returns to state 1. Starting at 
state 1, the probability of not having returned to state 1 within the first r steps is thus 


r 

n 

y'=i 


7 + 1 


t + 1 


Hence the probability of never returning to state 1 from state 1 is 0, and state 1 is recur¬ 


rent. It follows that 


t(t + 1) 


However, the expected number of steps until the first return to state 1 from state 1 is 

?=i ? 二 i 

which is unbounded. 

In the foregoing example the Markov chain had an infinite number of states. This is 
necessary for null recurrent states to exist. The proof of the following important lemma 
1 、 left as Exercise 7.9. 


Lemma 7.5: In a finite Markov chain: 

1. at least one state is recurrent', and 

2. all recurrent states are positive recurrent. 

Finally, for our later study of limiting distributions of Markov chains we will need to 
define what it means for a state to be aperiodic. As an example of periodicity, consider 
a random walk whose states are the positive integers. When at state /. with probabil- 
it\ 1/2 the chain moves to / + 1 and with probability 1/2 the chain moves to z — 1. If 
ihe chain starts at state 0, then it can be at an even-numbered state only after an even 
number of moves, and it can be at an odd-numbered state only after an odd number of 
nio\ es. This is an example of periodic behavior. 

Definition 7.6: A state j in a discrete time Markov chain is periodic if there exists an 
■.me^er A > 1 such that Pr(X, +v = j \ X, = j) = 0 unless s is divisible by A. A 
discrete time Markov chain is periodic if any state in the chain is periodic. A state or 
.navi that is not periodic is aperiodic. 
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In our example, every state in the Markov chain is periodic because, for every state j, 
Pr(X ?+v = j \ X r = j) = 0 unless 5 is divisible by 2. 

We end this section with an important corollary about the behavior of finite Markov 
chains. 

Definition 7.7: An aperiodic, positive recurrent state is an ergodic state. A Markov 
chain is ergodic if all its states are ergodic. 

Corollary 7.6: Any finite, irreducible, and aperiodic Markov chain is an ergodic chain. 


Proof: A finite chain has at least one recurrent state by Lemma 7.5, and if the chain is 
irreducible then all of its states are recurrent. In a finite chain, all recurrent states are 
positive recurrent by Lemma 7.5 and thus all the states of the chain are positive recur¬ 
rent and aperiodic. The chain is therefore ergodic. ■ 


7.2.1. Example: The Gambler’s Ruin 

When a Markov chain has more than one class of recurrent states, we are often inter¬ 
ested in the probability that the process will enter and thus be absorbed by a given 
communicating class. 

For example, consider a sequence of independent, fair gambling games between 
two players. In each round a player wins a dollar with probability 1/2 or loses a dollar 
with probability 1/2. The state of the system at time t is the number of dollars won by 
player 1. If player 1 has lost money, this number is negative. The initial state is 0. 

It is reasonable to assume that there are numbers and £2 such that player / can¬ 
not lose more than £, dollars, and thus the game ends when it reaches one of the two 
states —t \ or lj. At this point, one of the gamblers is ruined; that is, he has lost all his 
money. To conform with the formalization of a Markov chain, we assume that for each 
of these two end states there is only one transition out and that it goes back to the same 
state. This gives us a Markov chain with two absorbing, recurrent states. 

What is the probability that player 1 wins ii dollars before losing £1 dollars? If — 
£i, then by symmetry this probability must be 1/2. We provide a simple argument for 
the general case using the classification of the states. 

Clearly — and £ 2 are recurrent states. All other states are transient, since there is a 
nonzero probability of moving from each of these states to either state —i\ or state li. 

Let Pj be the probability that, after t steps, the chain is at state i. For —l\ < z < 
仑 2 , state i is transient and so lim,— % P/ = 0. 

Let q be the probability that the game ends with player 1 winning t 2 dollars, so that 
the chain was absorbed into state ti ，Then 1 — ^ is the probability the chain was ab¬ 
sorbed into state —£卜 By definition, 

lim P l , = q. 

r —00 - 

Since each round of the gambling game is fair, the expected gain of player 1 in each 
step is 0. Let W l be the gain of player 1 after t steps. Then E[W ? ] = 0 for any t by 
induction. Thus, 
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and 


Thus, 


E[W r ] = iP ! = 0 


lim E[W l ] = e 2 q — l { (\ - q) 

/ 一 00 



That is, the probability of winning (or losing) is proportional to the amount of money 
a player is willing to lose (or win). 

Another approach that yields the same answer is to let qj represent the probability 
that player 1 wins £2 dollars before losing l\ dollars when having won j dollars for 
—t\ < j < ^ 2 - Clearly, = 0 and — 1. For —i\ < j < (2 - w e compute by 
considering the outcome of the first game: 


^-1 , Qj +\ 

Hj 2 2 

We have £2 + 彳 1 — 2 linearly independent equations and f 2 + (1 — 2 unknowns, so there 
1 、 a unique solution to this set of equations. It is easy to verify that q, = ((] + j)/ 
I i 1 + l 2 ) satisfies the given equations. 

In Exercise 7.20, we consider the question of what happens if. as is generally the 
case in real life, one player is at a disadvantage and so is slightly more likely to lose 
than to win any single game. 


7.3. Stationary Distributions 


Recall that if P is the one-step transition probability matrix of a Markov chain and if 
fnt) is the probability distribution of the state of the chain at time t, then 

pit + 1) = p(t)P. 

Of particular interest are state probability distributions that do not change after a 
transition. 

Definition 7.8: A stationary distribution {also called an equilibrium distribution) of a 
Markov chain is a probability distribution fr such that 

亓 = 7TP 

It a chain ever reaches a stationary distribution then it maintains that distribution for 
■ill future time, and thus a stationary distribution represents a steady state or an equi¬ 
librium in the chain’s behavior. Stationary distributions play a key role in analyzing 
Markov chains. The fundamental theorem of Markov chains characterizes chains that 
jon\ erge to stationary distributions. 
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We discuss first the case of finite chains and then extend the results to any discrete 
space chain. Without loss of generality, assume that the finite set of states of the Markov 
chain is {0,1 ,..., n}. 


Theorem 7.7: Any finite, irreducible, and ergodic Markov chain has the following 
properties: 

1. the chain has a unique stationary distribution it = ( 丌 () ， jt\, ... ,7T n ): 

2. for all j and i, the limit Pj ; exists and it is independent of j\ 

3. Hi = lim^^ = \/hi,i. 

Under the conditions of this theorem, the stationary distribution jt has two interpre¬ 
tations. First, k ； is the limiting probability that the Markov chain will be in state z 
infinitely far out in the future, and this probability is independent of the initial state. 
In other words, if we run the chain long enough, the initial state of the chain is almost 
forgotten and the probability of being in state / converges to tt,. Second, 7r, is the in¬ 
verse of h ,,； = t - rj , the expected number of steps for a chain starting in state 
/to return to This stands to reason; if the average time to return to state i from i is 
hu, then we expect to be in state / for \/hu of the time and thus, in the limit, we must 
have 7i\ = \/h t j. 


Proof of Theorem 7.7: We prove the theorem using the following result, which we 
state without proof. 


Lemma 7.8: For any irreducible, ergodic Markov chain and for any state i, the limit 
lim^oc P! ; exists and 

lim P/,= 丄 . 

h u 

This lemma is a corollary of a basic result in renewal theory. We give an informal jus¬ 
tification for Lemma 7.8: the expected time between visits to / is //,.,，and therefore 
state i is visited 1//?,、, of the time. Thus lim,— % Pj y , which represents the probabil¬ 
ity a state chosen far in the future is at state / when the chain starts at state must be 

Using the fact that exists, we now show that, for any j and /', 

that is, these limits exist and are independent of the starting state j. 

Recall that r- 1 is the probability that starting at j, the chain first visits i at time t. 
Since the chain is irreducible we have that ； = 1, and for any s > 0 there exists 

(a finite) t\ = t](s) such that r ] t > 1 — £• 

For j .i, we have 


n> = il r j, p ^ k - 

k=\ 
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For t > t\, 

k—\ k=\ 

Using the facts that lim^oo P/exists and t\ is finite, we have 


Similarly, 


k=\ 

= Eo" t l^ p h 

k=\ 

k=\ 

>(\-s) lim P/ ; 


OEM; 

k=\ 


from which we can deduce that 



= E0w /iS P U kj r £ 

k=\ 

< lim P/, + s. 

— t — oc 1 ' 1 

Letting e approach 0, we have proven that, for any pair i and 7 , 
lim Pi ； = lim Pj ■= 丄 . 

t^OO J ' t^OC ' h}J 

Now let 

7 T, = lim Z 5 /,= -— . 

r 一 oc ’ hij 

We show that n = (tt 0 , 77 ^ ， .. . ） forms a stationary distribution. 

For every r > 0, we have Pj t > 0 and thus tt ； > 0. For any r 
and thus 

亡^6 =亡〜 =1 . 

/ = 0 / = 0 / = 0 
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and 7i is a proper distribution. Now, 

k=0 

Letting r oo, we have 

n 

兀 ， = Yl nkPLn 

人二 0 

proving that 亓 is a stationary distribution. 

Suppose there were another stationary distribution 0. Then by the same argument 
we would have 

n 

k=0 

and taking the limit as r oo yields 

n n 

= ^2 (pk 

k=Q k=0 

Since 4>k = 1 it follows that 0, = n, for all /, or 0 = fr. ■ 

It is worth making a few remarks about Theorem 7.7. First, the requirement that the 
Markov chain be aperiodic is not necessary for the existence of a stationary distribu¬ 
tion. In fact, any finite Markov chain has a stationary distribution; but in the case of a 
periodic state /, the stationary probability 丌 ,. is not the limiting probability of being in 
i but instead just the long-term frequency of visiting state /. Second, any finite chain 
has at least one component that is recurrent. Once the chain reaches a recurrent com¬ 
ponent, it cannot leave that component. Thus, the subchain that corresponds to that 
component is irreducible and recurrent, and the limit theorem applies to any aperiodic 
recurrent component of the chain. 

One way to compute the stationary distribution of a finite Markov chain is to solve 
the system of linear equations 

7rP = 7T. 

This is particularly useful if one is given a specific chain. For example, given the tran¬ 
sition matrix 



_ 0 

1/4 

0 

3/4- 

p = 

1/2 

0 

1/3 

1/6 

1/4 

1/4 

1/2 

0 


0 

1/2 

1/4 

1/4 _ 


we have five equations for the four unknowns ttq, tt\,7z 2 , and 7 t 3 given by 7 rP = 77 - and 
[Lo A = 1. The equations have a unique solution. 

Another useful technique is to study the cut-sets of the Markov chain. For any state 
z of the chain, 
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Figure 7.2: A simple Markov chain used to represent bursty behavior. 

71 U 

' =7T > = 71 > Pi -J 

J=0 7=0 

or 

= J2 7TiPi -j- 

j 和 j 和 

That is, in the stationary distribution the probability that a chain leaves a state equals 
the probability that it enters the state. This observation can be generalized to sets of 
Ntates as follows. 

Theorem 7.9: Let S be a set of states of a finite, irreducible, aperiodic Markov chain. 
In the stationary distribution, the probability that the chain leaves the set S equals the 
probability' that it enters S. 

In other words, if C is a cut-set in the graph representation of the chain, then in the sta¬ 
ll on ary distribution the probability of crossing the cut-set in one direction is equal to 
the probability of crossing the cut-set in the other direction. 

A basic but useful Markov chain that serves as an example of cut-sets is given in 
Figure 7.2. The chain has only two states. From state 0, you move to state 1 with prob¬ 
ability p and stay at state 0 with probability \ — p. Similarly, from state 1 you move 
:、、 >tate 0 with probability q and remain in state 1 with probability 1 — q. This Markov 
chain is often used to represent bursty behavior. For example, when bits are corrupted 
;n transmissions they are often corrupted in large blocks, since the errors are often 
caused by an external phenomenon of some duration. In this setting, being in state 0 
alter t steps represents that the rth bit was sent successfully, while being in state 1 rep- 
reNents that the bit was corrupted. Blocks of successfully sent bits and corrupted bits 
rvnh have lengths that follow a geometric distribution. When p and q are small, state 
changes are rare, and the bursty behavior is modeled. 

The transition matrix is 


V Ci 11 」. 

Solving ^-P = ft corresponds to solving the following system of three equations: 
7To(l — p) + ^\q = 7T ()： 

JTqP + 7Ti(l - q) = 7Ti ； 

71() + 7Tl = 1. 
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The second equation is redundant, and the solution is tiq = q/{p + q) and tt\ = 
pj{p + q). For example, with the natural parameters p = 0.005 and q = 0.1, in the 
stationary distribution more than 95% of the bits are received uncorrupted. 

Using the cut-set formulation, we have that in the stationary distribution the proba¬ 
bility of leaving state 0 must equal the probability of entering state 0, or 


- 冗 \q- 

Again, now using jtq 十 丌丨 = 1 yields ttq = qI 、 p + q 、 and tt\ = p/(p + q). 

Finally, for some Markov chains the stationary distribution is easy to compute by 
means of the following theorem. 

Theorem 7.10: Consider a finite, irreducible, and ergodic Markov chain with transi¬ 
tion matrixP. If there are nonnegative numbers fr = (ttq, , n n ) such that 丌 ,. = 
1 and if ， for any pair of states i, j ， 


兀 iPi'j = 

then 7T is the stationary distribution corresponding to P. 


Proof: Consider the jih entry of ttP. Using the assumption of the theorem, we find 
that it equals 

n n 

i=0 i=0 

Thus 亓 satisfies tt = nP. Since ^ - ; =0 7r ； = 1, it follows from Theorem 7.7 that tt must 
be the unique stationary distribution of the Markov chain. ■ 

Chains that satisfy the condition 

= TTjPj., 

are called time reversible.. Exercise 7.13 helps explain why. You may check that the 
chain of Figure 7.2 is time reversible. 

We turn now to the convergence of Markov chains with countably infinite state 
spaces. Using essentially the same technique as in the proof of Theorem 7.7, one can 
prove the next result. 


Theorem 7.11: Any irreducible aperiodic Markov chain belongs to one of the follow¬ 
ing two categories: 

J. the chain is ergodic - for any pair of states i and j, the limit lim^oo P- 1 exists 
and is independent of j, and the chain has a unique stationary distribution tt ；= 
lim^oc P- 1 > 0; or 

2. no state is positive recurrent — for all i and j, lim^oc P- d = 0, and the chain has 
no stationary distribution. 
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Cut-sets and the property of time reversibility can also be used to find the stationary 
distribution for Markov chains with countably infinite state spaces. 


7.3.1. Example: A Simple Queue 

A queue is a line where customers wait for service. We examine a model fora bounded 
queue where time is divided into steps of equal length. At each time step, exactly one 
of the following occurs. 

• If the queue has fewer than n customers, then with probability 入 a new customer 
joins the queue. 

• If the queue is not empty, then with probability fi the head of the line is served and 
leaves the queue. 

• With the remaining probability, the queue is unchanged. 

If X, is the number of customers in the queue at time r, then under the foregoing 

rules the X t yield a finite-state Markov chain. Its transition matrix has the following 

nonzero entries: 

Pi t i+\ = A. if z < n; 

Pij-\ = M if / > 0; 

( 1 —入 if i = 0, 

\ — X — ji if 1 < / < n — \. 

1 -/i if i = n. 

The Markov chain is irreducible, finite, and aperiodic, so it has a unique stationary 
distribution 亓 . We use tt = 亓 P to write 


7 To = (1 — 入)兀 0 十 从丌 U 

71 i = 入 777 _丨 + (1 —入一 fl)7li + /17T ; + i, \ < i < H — 

= XjT n —\ ~\~ (1 _ fl)7T n . 


It is easy to verify that 


入 V 


丌, = 丌 0 1 


i> a solution to the preceding system of equations. Adding the requirement 几 , 

1 . we have 


:() 1 = 0 




For all 0 < / < n. 


丌 (） 




EU(Vm)' 

( 入 /m)’ 

ELo(Vm) ; 


(7.4) 
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Another way to compute the stationary probability in this case is to use cut-sets. For 
any / ， the transitions / +1 and / +1 / constitute a cut-set of the graph represent¬ 

ing the Markov chain. Thus, in the stationary distribution, the probability of moving 
from state / to state / + 1 must be equal to the probability of moving from state / + 1 
to i, or 

入 7T/ = /i 7T;' +1 • 


A simple induction now yields 



In the case where there is no upper limit n on the number of customers in a queue, 
the Markov chain is no longer finite. The Markov chain has a countably infinite state 
space. Applying Theorem 7.11, the Markov chain has a stationary distribution if and 
only if the following set of linear equations has a solution with all 丌 />0: 


7T 0 = (1 — 入 )7T () + M 兀 1; 

71 i = 人兀, ■一]十 （ 1 — 人 一 /i)7T, 十 /i7T/_|_i, Z > 1 . 


(7.5) 


It is easy to verify that 




is a solution of the system of equations (7.5). This naturally generalizes the solution to 
the case where there is an upper bound n on the number of the customers in the system 
given in Eqn. (7.4). All of the 7T, are greater than 0 if and only if 入 < /x, which corre¬ 
sponds to the situation when the rate at which customers arrive is lower than the rate 
at which they are served. If 入 > //， then the rate at which customers arrive is higher 
than the rate at which they depart. Hence there is no stationary distribution, and the 
queue length will become arbitrarily long. In this case, each state in the Markov chain 
is transient. The case of 入 =// is more subtle. Again, there is no stationary distri¬ 
bution and the queue length will become arbitrarily long, but now the states are null 
recurrent. (See the related Exercise 7.17.) 


7.4. Random Walks on Undirected Graphs 

A random walk on an undirected graph is a special type of Markov chain that is often 
used in analyzing algorithms. Let G = (V, E) be a. finite, undirected, and connected 



Definition 7.9: A random walk on G is a Markov chain defined by the sequence of 
moves of a particle between vertices of G. In this process, the place of the particle at 
a given time step is the state of the system. If the particle is at vertex i and if i has d{i) 
outgoing edges, then the probability that the particle follows the edge (i, j) and moves 
to a neighbor j is 1 /d{i). 
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We have already seen an example of such a walk when we analyzed the randomized 
2-SAT algorithm. 

For a random walk on an undirected graph, we have a simple criterion for aperiod- 
icity as follows. 

Lemma 7.12: A random walk on an undirected graph G is aperiodic if and only if G 
is not bipartite. 

Proof: A graph is bipartite if and only if it does not have cycles with an odd number 
of edges. In an undirected graph, there is always a path of length 2 from a vertex to 
itself. If the graph is bipartite then the random walk is periodic with period d = 2. 
If the graph is not bipartite then it has an odd cycle, and by traversing that cycle we 
have an odd-length path from any vertex to itself. It follows that the Markov chain is 
aperiodic. ■ 

For the remainder of this section we assume that G is not bipartite. A random walk 
on a finite, undirected, connected, and non-bipartite graph G satisfies the conditions 
of Theorem 7.7, and hence the random walk converges to a stationary distribution. We 
show that this distribution depends only on the degree sequence of the graph. 

Theorem 7.13: A random walk on G converges to a stationary distribution tt, where 

d(v) 

丌 r - 


Proof: Since v <i(u) = 2| it follows that 


=E 

veV veV 


d(v) 

W\ 


and if is a proper distribution over v € V. 

Let P be the transition probability matrix of the Markov chain. Let N{v) represent 
the neighbors of v. The relation jt = 亓 P is equivalent to 

_ d{u) 1 — d(v) 

〜 = 丄^矿 丽 


Jtnd the theorem follows. 


Recall that h v ' u denotes the expected number of steps to reach u from v. We have the 
following corollary. 

Corollary 7.14: For any vertex u in G, 


2\E\ 

d{u) 


175 


MARKOV CHAINS AND RANDOM WALKS 


For any pair of vertices u and v, we prove the following simple bound. 

Lemma 7.15: If(u, v) g E, then h VtU < 2\E\. 

Proof: Let N(u) be the set of neighbors of vertex u in G. We compute h u ， u in two 
different ways: 


= , 

d(u) 


d(u) 


weN(u) 


Therefore, 


2|£|= E (1 +U ， 

weN(u) 


and we conclude that h lKll < 2\E\. ■ 

Definition 7.10: The cover time of a graph G = (V, E) is the maximum over all ver¬ 
tices v E V of The expected time to visit all of the nodes in the graph by a random walk 
starting from v. 

Lemma 7.16: The cover time of G — (V, E) is bounded above by 4| V | ■ | £|. 


Proof: Choose a spanning tree of G; that is, choose any subset of the edges that gives 
an acyclic graph connecting all of the vertices of G. There exists a cyclic (Eulerian) 
tour on this spanning tree in which every edge is traversed once in each direction: 
for example, such a tour can be found by considering the sequence of vertices passed 
through when doing a depth-first search. Let uo, ^i,..., ^ 2 jvh 2 = be the sequence 
of vertices in the tour, starting from vertex i ； o. Clearly the expected time to go through 
the vertices in the tour is an upper bound on the cover time. Hence the cover time is 
bounded above by 

2jVj-3 

X] h VhVi+] <(2\V\-2)(2\E\) <4\V\-\El 
/ =o 

where the first inequality comes from Lemma 7.15. ■ 


7.4.1. Application: An s-t Connectivity Algorithm 

Suppose we are given an undirected graph G = (V, E) and two vertices ^ and t in G. 
Let n = |V| and m = \E\. We want to determine if there is a path connecting 5 and 
t. This is easily done in linear time using a standard breadth-first search or depth-first 
search. Such algorithms, however, require Q (n) space. 

Here we develop a randomized algorithm that works with only 0(\ogn) bits of mem¬ 
ory. This could be even less than the number of bits required to write the path between 
s and t. The algorithm is simple: perform a random walk on G for enough steps so that 
a path from 5 to t is likely to be found. We use the cover time result (Lemma 7.16) to 
bound the number of steps that the random walk has to run. For convenience, assume 
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s-t Connectivity Algorithm: 

1. Start a random walk from 

2. If the walk reaches t within 4n 3 steps, return that there is a path. Otherwise, 
return that there is no path. 


Algorithm 7.4: s~t Connectivity algorithm. 


that the graph G has no bipartite connected components, so that the results of Theo¬ 
rem 7.13 apply to any connected component of G. (The results can be made to apply 
to bipartite graphs with some additional technical work.) 

Theorem 7.17: The s-t connectivity algorithm (Algorithm 7.4) returns the correct 
answer with probability 1/2, and it only errs by returning that there is no path from s 
lo t when there is such a path. 


Proof: If there is no path then the algorithm returns the correct answer. If there is a 
path, the algorithm errs if it does not find the path within 4/? 3 steps of the walk. The 
expected time to reach t from .v (if there is a path) is bounded from above by the cover 
time of their shared component, which by Lemma 7.16 is at most 4 讀 < 2n\ By 
Markov’s inequality, the probability that a walk takes more than 4/? 3 steps to reach ^ 
from t is at most 1/2. ■ 

The algorithm must keep track of its current position, which takes (9(log/?) bits, as 
well as the number of steps taken in the random walk, which also takes only (9(log«) 
bits (since we count up to only An 1 '). As long as there is some mechanism for choosing 
a random neighbor from each vertex, this is all the memory required. 


7.5. Par rondo ? s Paradox 

Parrondo’s paradox provides an interesting example of the analysis of Markov chains 
u hile also demonstrating a subtlety in dealing with probabilities. The paradox appears 
:o contradict the old saying that two wrongs don’t make a right, showing that two los¬ 
ing games can be combined to make a winning game. Because Parrondo's paradox can 
bo analyzed in many different ways, we will go over several approaches to the problem. 

First, consider game A, in which we repeatedly flip a biased coin (call it coin a) that 
^omes up heads with probability p a < 1/2 and tails with probability 1 — p a . You win a 
丄 、 liar if the coin comes up heads and lose a dollar if it comes up tails. Clearly, this is a 
.oving game for you. For example, if p a = 0.49, then your expected loss is two cents 
per game. 

In game B, we also repeatedly flip coins, but the coin that is flipped depends on how 
you have been doing so far in the game. Let w be the number of your wins so far and 
■r the number of your losses. Each round we bet one dollar, so iv — i represents your 
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winnings; if it is negative, you have lost money. Game B uses two biased coins, coin b 
and coin c. If your winnings in dollars are a multiple of 3, then you flip coin b, which 
comes up heads with probability p h and tails with probability 1 — p b . Otherwise, you 
flip coin c, which comes up heads with probability p c and tails with probability l — p c - 
Again, you win a dollar if the coin comes up heads and lose a dollar if it comes up tails. 

This game is more complicated, so let us consider a specific example. Suppose coin 
b comes up heads with probability = 0.09 and tails with probability 0.91 and that 
coin c comes up heads with probability p c = 0.74 and tails with probability 0.26. At 
first glance, it might seem that game B is in your favor. If we use coin b for the 1/3 
of the time that your winnings are a multiple of 3 and use coin c the other 2/3 of the 
time, then your probability w of winning is 

19 2 74 157 1 

w = - 十 - = -> —• 

3 100 3 100 300 2 

The problem with this line of reasoning is that coin b is not necessarily used 1/3 of 
the time! To see this intuitively, consider what happens when you first start the game, 
when your winnings are 0. You use coin b and most likely lose, after which you use 
coin c and most likely win. You may spend a great deal of time going back and forth 
between having lost one dollar and breaking even before either winning one dollar or 
losing two dollars, so you may use coin b more than 1/3 of the time. 

In fact, the specific example for game B is a losing game for you. One way to show 
this is to suppose that we start playing game B when your winnings are 0, continuing 
until you either lose three dollars or win three dollars. If you are more likely to lose 
than win in this case, by symmetry you are more likely to lose three dollars than win 
three dollars whenever your winnings are a multiple of 3. On average, then, you would 
obviously lose money on the game. 

One way to determine if you are more likely to lose than win is to analyze the ab¬ 
sorbing states. Consider the Markov chain on the state space consisting of the integers 
{—3,..., 3}, where the states represent your winnings. We want to know, when you 
start at 0, whether or not you are more likely to reach —3 before reaching 3. We can 
determine this by setting up a system of equations. Let Zi represent the probability you 
will end up having lost three dollars before having won three dollars when your cur¬ 
rent winnings are i dollars. We calculate all the probabilities z_ 3 , z— 2 , Z-i, z(), Zi, Zi, 
and Z 3 , although what we are really interested in is zo- If Zo > 1 / 2 , then we are more 
likely to lose three dollars than win three dollars starting from 0. Here = 1 and 
^3 = 0; these are boundary conditions. We also have the following equations: 

Z-2 = (1 — A ： ‘）十 
4-1 = (1 - Pc)Z-2 + PcZo, 

Zo = (1 - Pb)Z-\ + PbZ\, 

Z\ = (\- Pc)Z() + PcZl, 

^2 = (1 - Pc)Z] + p c Z3- 

This is a system of five equations with five unknowns, and hence it can be solved easily. 
The general solution for zq is 
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，— (1 _ p h ){\ - p c )~ 

' ，0 (1 - Pb){\ - Pc) 2 + p h p ； ' 

For the specific example here, the solution yields zo = 15.379 27.700 % 0.555, show¬ 
ing that one is much more likely to lose than win playing this game over the long run. 

Instead of solving these equations directly, there is a simpler way of determining 
the relative probability of reaching —3 or 3 first. Consider any sequence of moves that 
starts at 0 and ends at 3 before reaching —3. For example, a possible sequence is 

= 0,1,2,1,2,1,0,-1,-2,-1,0,1,2,1,2,3- 

We create a one-to-one and onto mapping of such sequences with the sequences that 
start at 0 and end at —3 before reaching 3 by negating every number starting from the 
last 0 in the sequence. In this example, maps to f(s), where 

f (^) = 0,1,2,1,2,1,0, — 1, — 2, — 1,0, — 1 ， —2, — 1 ，一2 , — 3. 

It is simple to check that this is a one-to-one mapping of the relevant sequences. 

The following lemma provides a useful relationship between and f(s). 

Lemma 7.18: For any sequence s of moves that starts at 0 and ends at 3 before reach¬ 
ing —3, we have 

Pr (丨 occurs) Pbp] 

Pr( f(s) occurs) (1 - p h ){\ - p c ) 2 ' 

Proof: For any given sequence s satisfying the properties of the lemma, let t\ be the 
number of transitions from 0 to 1; 匕 ， the number of transitions from 0 to — 1: r 3 . the 
Mim of the number of transitions from —2 to —1 ，一 1 to 0 ， 1 to 2, and 2 to 3; and t^, 
the sum of the number of transitions from 2 to 1, 1 to 0, —1 to —2, and —2 to —3. Then 
the probability that the sequence x occurs is p T h ] (\ — Pb) r2 P:'H! — PcY 4 - 

Now consider what happens when we transform to f(s). We change one transi¬ 
tion from 0 to 1 into a transition from 0 to — 1. After this point, in .v the total number of 
transitions that move up 1 is two more than the number of transitions that move down 
1. since the sequence ends at 3. In f(s), then, the total number of transitions that move 
down 1 is two more than the number of transitions that move up 1. It follows that the 
probability that the sequence f(s) occurs is pj 厂 — p/))’ 2+1 〆 3 — 2 (1 — p r ) ，4+2 - The 
icmma follows. ■ 

B\ letting S be the set of all sequences of moves that start at 0 and end at 3 before 
reaching —3, it immediately follows that 

Pr(3 is reached before — 3) P r (、' occurs) pi,p~ 

Pr(—3 is reached before 3) J2se,s P r (/(^) occurs) (1 — pb)( \ — p ( ) 2 

I: this ratio is less than 1, then you are more likely to lose than win. In our specific 
example, this ratio is 12,321/15,379 < 1. 

Finally, yet another way to analyze the problem is to use the stationary distribu- 
： ion. Consider the Markov chain on the states {0,1,2}, where here the states represent 
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the remainder when our winnings are divided by 3. (That is, the state keeps track of 
w - i mod 3.) Let 7 r ( be the stationary probability of this chain. The probability that 
we win a dollar in the stationary distribution, which is the limiting probability that we 
win a dollar if we play long enough, is then 

Pb 丌 0 + A, 兀 1 + Pc ^2 = Pb^i) + a (1 - 丌 0) 

― Pc — ( Pc — Pb ) ^0 - 


Again, we want to know if this is greater than or less than 1/2. 
The equations for the stationary distribution are easy to write: 

7T 0 十 7T! + 712 = 1 ， 

Pb7T {) + (1 _ Pc)JT 2 = 7T\, 

Pc^\ + (1 _ Pb)^Q = ^2, 

Pc 丌 2 + (1 - Pc)JT\ = 7Z ( ). 


Indeed, since there are four equations and only three unknowns, one of these equations 
is actually redundant. The system is easily solved to find 

丌 — _ \ - Pc + _ 

3 — 2p ( .— 外 十 2pbp c p 】 

_ PbPc - Pc + 1 

1 3 — 2 p c — pb + 2php c + 

_ PbPc ~ Pfc + 1 

^ — 3 - 2p c - Pb^r 2p b p c + pj 

Recall that you lose if the probability of winning in the stationary distribution is less 
than 1/2 or. equivalently, if p c — (p c — p/,)7r 0 < 1/2. In our specific example, jtq = 
673/1759 ^ 0.3826.... and 

86,421 1 


Again, we find that game 5 is a losing game in the long run. 

We have now completely analyzed game A and game B. Next let us consider what 
happens when we try to combine these two games. In game C, we repeatedly perform 
the following bet. We start by flipping a fair coin, call it coin d. If coin d is heads, we 
proceed as in game A: we flip coin a, and if the coin is heads, you win. If coin d is 
tails, we then proceed to game B : if your current winnings are a multiple of 3, we flip 
coin b\ otherwise, we flip coin c, and if the coin is heads then you win. It would seem 
that this must be a losing game for you. After all, game A and game B are both losing 
games, and this game just flips a coin to decide which of the two games to play. 

In fact, game C is exactly like game B, except the probabilities are slightly differ¬ 
ent. If your winnings are a multiple of 3, then the probability that you win is pi = 
~p a + {Pb- Otherwise, the probability that you win is p* — ~p a + \p c . Using p* b 
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and p* in place of pi ? and p c , we can repeat any of the foregoing analyses we used for 
game B. 

For example: if the ratio 


p ： (p：y < j 
(i-K)o-p *) 2 ’ 

then the game is a losing game for you; if the ratio is larger than 1, it is a winning game. 
In our specific example the ratio is 438,741/420,959 > 1, so game C appears to be a 
winning game. 

This seems somewhat odd, so let us recheck by using our other approach of consid¬ 
ering the stationary distribution. The game is a losing game if p* — {p* — pl)7io < 
1/2 and a winning game if p* — (p: — pi) 丌 o > 1/2, where ttq is now the station¬ 
ary distribution for the chain corresponding to game C. In our specific example, 7 Tq = 
30.529/88,597, and 


Pc - (Pc - Pb^o = 


4,456,523 

8,859,700 


2 


so game C again appears to be a winning game. 

How can randomly combining two losing games yield a winning game? The key 
is that game B was a losing game because it had a very specific structure. You were 
likely to lose the next round in game B if your winnings were divisible by 3. but if 
you managed to get over that initial barrier you were likely to win the next few games 
as well. The strength of that barrier made game B a losing game. By combining the 
games that barrier was weakened, because now when your winnings are divisible by 3 
you sometimes get to play game A, which is close to a fair game. Although game A 
is biased against you, the bias is small, so it becomes easier to overcome that initial 
barrier. The combined game no longer has the specific structure required to make it a 
losing game. 

You may be concerned that this seems to violate the law of linearity of expectations. 
If the winnings from a round of game A, B, and C are Xa, X b , and Xc (respectively), 
then it seems that 


E[X C ] = E[ ] 2 X a + |X S ] = \E[X a ]+ 1 2 E[X b I 

so if E[X^] and E[X^] are negative then E[Zc] should also be negative. The problem 
is that this equation does not make sense, because we cannot talk about the expected 
winnings of a round of games B and C without reference to the current winnings. We 
have described a Markov chain on the states {0,1, 2} for games B and C. Let 夕 repre¬ 
sent the current state. We have 

E[X C |.v] = E[|(^ + X b ) U] 

= \s]^\E[X b hv]. 

Linearity of expectations holds for any given step, but we must condition on the cur¬ 
rent state. By combining the games we have changed how often the chain spends in 
each state, allowing the two losing games to become a winning game. 
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so P ().3 = 3/5 is the probability of moving from state 0 to state 3. 

(a) Find the stationary distribution of the Markov chain. 

(b) Find the probability of being in state 3 after 32 steps if the chain begins at state 0. 

(c) Find the probability of being in state 3 after 128 steps if the chain begins at a state 
chosen uniformly at random from the four states. 

(d) Suppose that the chain begins in state 0. What is the smallest value of t for which 
max iV I P ( , — 7z s I < 0.01? Here fr is the stationary distribution. What is the smallest 
value of t for which max tV | Pq s — tc s \ < 0.001? 

Exercise 7.2: Consider the two-state Markov chain with the following transition matrix. 

P= [/ 1 叫. 

J ~ P P _ 

Find a simple expression for P。’ 0 . 

Exercise 7.3: Consider a process Xq, X\, X 2 ,... with two states, 0 and 1. The process 
is governed by two matrices, P and Q. If k is even, the values Pi j give the proba¬ 
bility of going from state i to state j on the step from to Xk+\. Likewise, if k is 
odd then the values Qi j give the probability of going from state i to state j on the 
step from to Xk+\. Explain why this process does not satisfy Definition 7.1 of a 
(time-homogeneous) Markov chain. Then give a process with a larger state space that 
is equivalent to this process and satisfies Definition 7.1. 

Exercise 7.4: Prove that the communicating relation defines an equivalence relation. 

Exercise 7.5: Prove that if one state in a communicating class is transient (respectively, 
recurrent) then all states in that class are transient (respectively, recurrent). 

Exercise 7.6: In studying the 2-SAT algorithm, we considered a 1-dimensional ran¬ 
dom walk with a completely reflecting boundary at 0. That is, whenever position 0 
is reached, with probability 1 the walk moves to position 1 at the next step. Consider 
now a random walk with a partially reflecting boundary at 0. Whenever position 0 is 
reached, with probability 1/2 the walk moves to position 1 and with probability 1/2 
the walk stays at 0. Everywhere else the random walk moves either up or down 1, each 
with probability 1/2. Find the expected number of moves to reach n , starting from po¬ 
sition i and using a random walk with a partially reflecting boundary. 


7.6. Exercises 
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Exercise 7.7: Suppose that the 2-SAT Algorithm 7.1 starts with an assignment chosen 
uniformly at random. How does this affect the expected time until a satisfying assign¬ 
ment is found? 


Exercise 7.8: Generalize the randomized algorithm for 3-SAT to k-SAI. What is the 
expected time of the algorithm as a function of k? 


Exercise 7.9: In the analysis of the randomized algorithm for 3-SAT, we made the pes¬ 
simistic assumption that the current assignment A, and the truth assignment S differ on 
just one variable in the clause chosen at each step. Suppose instead that, independently 
at each step, the two assignments disagree on one variable in the clause with probabil¬ 
ity p and at least two variables with probability 1 — p. What is the largest value of p 
for which you can prove that the expected number of steps before Algorithm 7.2 termi¬ 
nates is polynomial in pi Give a proof for this value of p and give an upper bound on 
the expected number of steps in this case. 

Exercise 7.10: A coloring of a graph is an assignment of a color to each of its vertices. 
A graph is colorable if there is a coloring of the graph with k colors such that no two 
adjacent vertices have the same color. Let G be a 3-colorable graph. 

(a) Show that there exists a coloring of the graph with two colors such that no triangle 
is monochromatic. (A triangle of a graph G is a subgraph of G with three vertices, 
which are all adjacent to each other.) 

(b) Consider the following algorithm for coloring the vertices of G with two colors 
so that no triangle is monochromatic. The algorithm begins with an arbitrary 2- 
coloring of G. While there are any monochromatic triangles in G, the algorithm 
chooses one such triangle and changes the color of a randomly chosen vertex of 
that triangle. Derive an upper bound on the expected number of such recoloring 
steps before the algorithm finds a 2-coloring with the desired property. 


Exercise 7.11: Ann x n matrix P with entries P itj is called stochastic if all entries are 
nonnegative and if the sum of the entries in each row is 1. It is called doubly stochas¬ 
tic if, additionally, the sum of the entries in each column is 1. Show that the uniform 
distribution is a stationary distribution for any Markov chain represented by a doubly 
stochastic matrix. 

Exercise 7.12: Let X n be the sum of n independent rolls of a fair die. Show that, for 
any k >2, 

lim Pr(X„ is divisible by k)— 

” 一 oc k 

Exercise 7.13: Consider a finite Markov chain on n states with stationary distribution 
.T and transition probabilities P L 卜 Imagine starting the chain at time 0 and running 
it for m steps, obtaining the sequence of states Xo, Zi,..., X m . Consider the states in 
reverse order, X m , X OT _i,..X q. 
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(a) Argue that given X^ + i, the state is independent of Xk+ 2 , ^k+ 3 , ■.X m . Thus 
the reverse sequence is Markovian. 

(b) Argue that for the reverse sequence, the transition probabilities Q t j are given by 


Qi.j = 




丌，. 


(c) Prove that if the original Markov chain is time reversible, so that 777 尸 ,' > = JTjPjj, 
then Q,\j = P ( / . That is, the states follow the same transition probabilities whether 
viewed in forward order or reverse order. 


Exercise 7.14: Prove that the Markov chain corresponding to a random walk on an 
undirected, non-bipartite graph that consists of one component is time reversible. 


Exercise 7.15: Let 尸 /',. be the probability that a Markov chain returns to state i when 
started in state i after t steps. Prove that 


OO 

?=i 

is unbounded if and only if state i is recurrent. 


Exercise 7.16: Prove Lemma 7.5. 


Exercise 7.17: Consider the following Markov chain, which is similar to the 1-dimen¬ 
sional random walk with a completely reflecting boundary at 0. Whenever position 0 
is reached, with probability 1 the walk moves to position 1 at the next step. Otherwise, 
the walk moves from i to ; + 1 with probability p and from i tew. — 1 with probability 
1 — p. Prove that: 

(a) if p < 1 / 2 , each state is positive recurrent; 

(b) if p = 1 / 2 , each state is null recurrent; 

(c) if p > 1 / 2 , each state is transient. 

Exercise 7.18: (a) Consider a random walk on the 2-dimensional integer lattice, where 
each point has four neighbors (up, down, left, and right). Is each state transient, null 
recurrent, or positive recurrent? Give an argument. 

(b) Answer the problem in (a) for the 3-dimensional integer lattice. 


Exercise 7.19: Consider the gambler’s ruin problem, where a player plays until they 
lose t\ dollars or win dollars. Prove that the expected number of games played is 

Uh- 

Exercise 7.20: We have considered the gambler’s ruin problem in the case where the 
game is fair. Consider the case where the game is not fair; instead, the probability of 
losing a dollar each game is 2/3 and the probability of winning a dollar each game is 
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1/3. Suppose that you start with i dollars and finish either when you reach n or lose it 
all. Let W t be the amount you have gained after t rounds of play. 


(a) Show that E[2^+'] = E[2^]. 

(b) Use part (a) to determine the probability of finishing with 0 dollars and the proba¬ 
bility of finishing with n dollars when starting at position /. 

(c) Generalize the preceding argument to the case where the probability of losing is 
p > 1/2. (Hint: Try considering E[c lv, ] for some constant c.) 

Exercise 7.21: Consider a Markov chain on the states {0.1_ _ n}, where for i < « we 

have P,\i+\ = 1/2 and 尸 /,() = 1/2. Also, P n n = 1/2 and P n {) = 1/2. This process can 
be viewed as a random walk on a directed graph with vertices {0,1 ,...,where each 
vertex has two directed edges: one that returns to 0 and one that moves to the vertex 
with the next higher number (with a self-loop at vertex n). Find the stationary distribu¬ 
tion of this chain. (This example shows that random walks on directed graphs are very 
different than random walks on undirected graphs.) 


Exercise 7.22: A cat and a mouse each independently take a random walk on a con¬ 
nected, undirected, non-bipartite graph G. They start at the same time on different 
nodes, and each makes one transition at each time step. The cat eats the mouse if they 
are ever at the same node at some time step. Let n and m denote, respectively, the num¬ 
ber of vertices and edges of G. Show an upper bound of 0(m 2 n) on the expected time 
before the cat eats the mouse. (Hint: Consider a Markov chain whose states are the or¬ 
dered pairs (a,b), where a is the position of the cat and b is the position of the mouse.) 

Exercise 7.23: One way of spreading information on a network uses a rumor-spreading 
paradigm. Suppose that there are n hosts currently on the network. Initially, one host 
begins with a message. Each round, every host that has the message contacts another 
host chosen independently and uniformly at random from the other n — 1 hosts, and 
sends that host the message. We would like to know how many rounds are necessary 
before all hosts have received the message with probability 0.99. 

(a) Explain how this problem can be viewed in terms of Markov chains. 

(b) Determine a method for computing the probability that j hosts have received the 
message after round k given that i hosts have received the message after round k — \. 
( Hint: There are various ways of doing this. One approach is to let P(i 7 j,c) be 
the probability that j hosts have the message after the first c of the i hosts have 
made their choices in a round; then find a recurrence for P.) 

(c) As a computational exercise, write a program to determine the number of rounds 
required for a message starting at one host to reach all other hosts with probability 
0.9999 when n = 128. 

Exercise 7.24: The lollipop graph on n vertices is a clique onn/2 vertices connected 
to a path on /7/2 vertices, as shown in Figure 7.3. The node u is a part of both the clique 
and the path. Let v denote the other end of the path. 
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(a) Show that the expected covering time of a random walk starting at v is @(n 2 ). 

(b) Show that the expected covering time for a random walk starting at u is @(n 3 ). 

Exercise 7.25: The following is a variation of a simple children’s board game. A player 
starts at position 0. On a player's turn, she rolls a standard six-sided die. If her old po¬ 
sition was the positive integer x and her roll is v, then her new position is x + _y, except 
in two cases: 

• if x + y is divisible by 6 and less than 36, her new position is x + >• — 6; 

• if x + v is greater than 36, the player remains at x. 

The game ends when a player reaches the goal position, 36. 

(a) Let X ； be a random variable representing the number of rolls needed to get to 36 
from position i for 0 < i < 35. Give a set of equations that characterize E[X,]. 

(b) Using a program that can solve systems of linear equations, find E[X, ] for 0 < 
i < 35. 

Exercise 7.26: Let n equidistant points be marked on a circle. Without loss of gener¬ 
ality, we think of the points as being labeled clockwise from 0 to « — 1. Initially, a wolf 
begins at 0 and there is one sheep at each of the remaining n — 1 points. The wolf takes 
a random walk on the circle. For each step, it moves with probability 1/2 to one neigh¬ 
boring point and with probability 1/2 to the other neighboring point. At the first visit 
to a point, the wolf eats a sheep if there is still one there. Which sheep is most likely 


Exercise 7.27: Suppose that we are given n records, R\, Rj,. ■.R n - The records are 
kept in some order. The cost of accessing the j{h record in the order is j. Thus, if we 
had four records ordered as R 2 , R 4 , R 3 , then the cost of accessing R 4 would be 2 
and the cost of accessing R\ would be 4. 

Suppose further that, at each step, record Rj is accessed with probability pj, with 
each step being independent of other steps. If we knew the values of the pj in advance, 
we would keep the Rj in decreasing order with respect to p r But if we don’t know the 
Pj in advance, we might use the “move to front” heuristic: at each step, put the record 
that was accessed at the front of the list. We assume that moving the record can be 
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done with no cost and that all other records remain in the same order. For example, if 
the order was Ri, R 4y /? 3 , R\ before Rt, was accessed, then the order at the next step 
uould be Rt,, R 2 , R 4 , R\. 

In this setting, the order of the records can be thought of as the state of a Markov 
chain. Give the stationary distribution of this chain. Also, let Xk be the cost for ac¬ 
cessing the kth requested record. Determine an expression for E| ). Your 

expression should be easily computable in time that is polynomial in given the />；. 

Exercise 7.28: Consider the following variation of the discrete time queue. Time is 
divided into fixed-length steps. At the beginning of each time step, a customer arrives 
a ith probability 入 • At the end of each time step, if the queue is nonempty then the cus¬ 
tomer at the front of the line completes service with probability fi. 

«a) Explain how the number of customers in the queue at the beginning of each time step 
forms a Markov chain, and determine the corresponding transition probabilities. 

\ b) Explain under what conditions you would expect a stationary distribution tt to 
exist. 

tel If a stationary distribution exists, then what should be the value of 丌()， the proba¬ 
bility that no customers are in the queue at the beginning of the time step? (Hint: 
Consider that, in the long run, the rate at which customers enter the queue and the 
rate at which customers leave the queue must be equal.) 

<d) Determine the stationary distribution and explain how it corresponds to your con¬ 
ditions from part (b). 

• e) Now consider the variation where we change the order of incoming arrivals and 
service. That is: at the beginning of each time step, if the queue is nonempty then 
a customer is served with probability fi ： and at the end of a time step a customer 
arrives with probability X. How does this change your answers to parts (a)-(d)? 
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CHAPTER EIGHT 

Continuous Distributions 
and the Poisson Process 


This chapter introduces the general concept of continuous random variables, focusing 
on two examples of continuous distributions: the uniform distribution and the expo¬ 
nential distribution. We then proceed to study the Poisson process, a continuous time 
counting process that is related to both the uniform and exponential distributions. We 
conclude this chapter with basic applications of the Poisson process in queueing theory. 


8.1. Continuous Random Variables 


8.1.1. Probability Distributions in R 


The continuous roulette wheel in Figure 8.1 has circumference 1. We spin the wheel, 
and when it stops, the outcome is the clockwise distance X (computed with infinite 
precision) from the “0” mark to the arrow. 

The sample space ^2 of this experiment consists of all real numbers in the range 
[0,1). Assume that any point on the circumference of the disk is equally likely to face 
the arrow when the disk stops. What is the probability of a given outcome x? 

To answer this question, we recall that in Chapter 1 we defined a probability func¬ 
tion to be any function that satisfies the following three requirements: 

1. Pr ⑼ =1; 

2. for any event E, 

0 < Pr(E) < 1; 

3. for any (finite or enumerable) collection B of disjoint events, 

Pr( |J = E Pr(£) . 

\EgB ' EeB 

Let S{k) be 3. set of k distinct points in the range [0,1), and let p be the probabil¬ 
ity that any given point in [0,1) is the outcome of the roulette experiment. Since the 
probability of any event is bounded by 1, 
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PrixESik)) =kp <\. 


We can choose any number k of distinct points in the range [0,1), so we must have 
<p < 1 for any integer/:, which implies that p = 0. Thus, we observe that in an infinite 
simple space there may be possible events that have probability 0. Taking the comple¬ 
ment of such an event, we observe that in an infinite sample space there can be events 
with probability 1 that do not correspond to all possible experimental outcomes, and 
thus there can be events with probability 1 that are, in some sense, not certain! 

It、the probability of each possible outcome of our experiment is 0, how do we define 
ihe probability of larger events with nonzero probability? For probability distributions 
o\er M, probabilities are assigned to intervals rather than to individual \ alues. 1 

The probability distribution of a random variable .Y given by its distrihution func- 
:nm F(x), where for any x e R we define 


F(x) = Pr(X < .v). 


We say that a random variable X is continuous if its distribution function F(x) is 
a continuous function of x. We will assume that our random variables are continuous 
throughout this chapter. In this case, we must have that Pr (X — x) — () tor any spe- 
wiiic value .y. This further implies that Pr(Z < .v) = Pr(X < .v). a tact w e make use 
v'f freely throughout this chapter. 

If there is a function f(x) such that, for all —oc < a < oc. 


F{a )= 





:hen f(x) is called the density function of F(x), and 


fix) = F'ix) 


•a here the derivative is well-defined. 


\ treatment of nondenumerably infinite probability spaces relies on measure theory and is beyond the 

、 、pe of this book. We just note here that the probability function needs to be measurable on the set of events. 

cannot hold in general for the family of all subsets of the sample space, but it does always hold for the 
h'rd、et of intervals. 
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Pr(x < X < x + dx) = F(x + dx) - F{x) ^ f(x) dx, 

we can informally think of f(x) dx as the “probability” of the infinitesimal interval 
[x,x + dx). Carrying this analogy forward, in discrete spaces the probability of an 
event E is the sum of the probabilities of the simple events included in E. The paral¬ 
lel concept in the case of events in R is the integral of the probability density function 
over the basic events in E. 

For example, the probability of the interval [a, b) is given by the integral 

Pr(<a < X < b) = f f{x) dx, 

J a 

and the expectation and higher moments of a random variable X with density function 
f{x) are defined by the integrals 


E[X ; ]= 



x'f(x) dx. 


More generally, for any function g, 


E[g(X)] = / g(x)f(x) dx, 

J ~OQ 

when this integral exists. The variance of X is given by 

疒 00 

Var[X] = E[(X - E[X]) 2 ] = / (x - E[X]) 2 f(x)dx = E[X 2 ] - (E[X]y 


The following lemma gives the continuous analog to Lemma 2.9. 


Lemma 8.1: Let X be a continuous random variable that takes on only nonnegative 
values. Then 


E[X] = 



Pr(X > x) dx. 


Proof: Let /U) be the density function of X. Then 


.00 r oc 


Pr(X > jc) dx 


f{y)dydx 


c=0 J x—x 


f (y) dx dy 


r 二 0 t/A ~0 

^ DO 

yfiy)dy 

i'—0 


=E[X]. 

The interchange of the order of the integrals is justified because the expression being 
integrated is nonnegative. ■ 
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8.1.2. Joint Distributions and Conditional Probability 

The notion of a distribution function for a real-valued random variable easily general¬ 
izes to multiple random variables. 

Definition 8.1: The joint distribution function of X and Y is 
F(x, v) = Pr(X < ,v. Y < v). 

The variables X and Y have joint density function f if, for till x. y. 

F(x, v) = / / f(u. i) du ch. 

«/ — 'OC J — 3C 

Again, we denote 

d 2 

f(x,y) = ——F(.\.x) 

8x 8y 

when the derivative exists. These definitions are generalized to joint distribution func¬ 
tions over more than two variables in the obvious way. 

Given a joint distribution function F(x, y) over X and Y. one may consider the mar¬ 
ginal distribution functions 

F x (x) = Pr(X < x), Fyiy) = Pr( Y < \ i. 

and the corresponding marginal density functions f x (.v) and / > i \. i. 

Definition 8.2: The random variables X and Y are independent if, for all .x and y. 
Pr((X < x) Pi (7 < y)) = Pr(X < .v) Pr( Y < v), 

From the definition, two random variables are independent if and only if their joint 
viistribution function is the product of their marginal distribution functions: 

F(.x,y) = F x (x)F Y (y). 

It follows from taking the derivatives with respect to x and y that, if Z and Y are inde¬ 
pendent, then 

/U ， .v) = fxU)f Y (y)^ 

and this condition is sufficient as well. 

As an example, let a and b be positive constants, and consider the joint distribution 
: unction for two random variables X and Y given by 

F(x,y) = \-&- ax -Q~ hx + Q~ Ul ^ hx} 

o\cr the range x, v > 0. We can compute that 

F x (x) = F(x, oo) = 1 - 

and similarly F Y (y) = 1 — e _/n '. Alternatively, we could compute 


191 



CONTINUOUS DISTRIBUTIONS AND THE POISSON PROCESS 


f(x,y)=abQ- (ax+hx \ 

from which it follows that 

F x (z) = f f abc~ {ax + by} dy dx = f -ac~ ax = 1 - e"' 

t/.r—0 Jy=0 J,x~0 

We obtain 


F(x, y) = 1 - e~ ax - Q~ hY 4 - e — … v+ ㈣ =(1 — e _ ，（l — e— fer ) = F x (x)F Y (y), 

so X and Y are independent. Alternatively, working with the density functions we ver¬ 
ify their independence by 


fx(x) = ae- tLX , f Y (y) = be- 1 ' f(x,y) = f x {x)f Y (y). 

Conditional probability for continuous random variables introduces a nontrivial sub¬ 
tlety. The natural definition, 


Pr(£ I F)= 


Pr(E n F) 
Pr(F) 


is suitable when Pr(F) ^ 0. For example, 


Pr(X < 3 I F < 6)= 


Pr((X 5 3) n (F 5 6)) 
Pr(F < 6) 


when Pr(F < 6) is not zero. 

In the discrete case, if Pr(F) = 0 then Pr(£ , | F) was simply not well-defined. In 
the continuous case, there are well-defined expressions that condition on events that 
occur with probability 0. For example, for the joint distribution function F(x,y )= 
1 — e _flA — e~ hy + c~ {ax+h y } examined previously, it seems reasonable to consider 

Pr(X < 3 I F = 4), 


but since Pr(F = 4) is an event with probability 0, the definition is not applicable. 
If we did apply the definition, it would yield 


Pr(X < 3 I r = 4)= 


Pr((X < 3) n (F = 4)) 
Pr(F = 4) 


Both the numerator and denominator are zero, suggesting that we should be taking a 
limit as they both approach zero. The natural choice is 

Pr(X < 3 I F = 4) = lim Pr(X <3|4<F<4 + 5). 

<5 — () 


This choice leads us to the following definition: 

Pr(X<x I F = v) = f X 

Ju=-oc fr(y) 

To see informally why this is a reasonable choice, consider 
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lim Pr(X < x \ y < Y < y 8) 


lim 


lim 

5^0 


Pr((X < x) fl (v < F < v + 5)) 
Pr( v < F < v + 5) 

F(x, v + <5) - F(.v, v) 


lim 

5^0 


F Y {y + <5) - F Y {y) 

f x aF( W ，j + «5)/ax 


dF(u, y)/3x 


fyiy + <5) - Fy(y) 


dit 


Hm (陶 ’ + 華-歐 v ) 卿 du 


(Fy(y + 5) - F Y (y))/8 


/(“，>，) 

: -oo friy) 


du. 


Here we have assumed that we can interchange the limit with the integration and that 
_,} (>’）/ 0. 

The value 

r fix. v) 

/xir(-v.y)=——- 
f) ( v) 


卜 also called a conditional densit}' function. We may similarly use 


(.v. 


Our definition yields the natural interpretation that, in order to compute Pr(X < .v | 

> =v), we integrate the corresponding conditional density function over the appropri¬ 
ate range. You can check that this definition yields the standard definition for Pr (X < | 

> < v) through appropriate integration. Similarly, we may compute the conditional 
expectation 


E[X I Y = y ] 二 

using the conditional density function. 
For our example, when F(x, v)= 

Pr(X < 3 I F = 4) 


-xf X \Y(x,y)dx 

—cx- 

— c~ ax — e~ by + e _(t,A ' +/n ) , it follows that 

f 3 abc~ ax+4h 
/«=() 


du 


- 3 < / 


result we could also have achieved directly using independence. 


8.2. The Uniform Distribution 

When a random variable X assumes values in the interval [a, b] such that all subinter- 
\als of equal length have equal probability, we say that X has the uniform distribution 
o\ er the interval [a, b] or alternatively that it is uniform over the interval [a,b]. We de¬ 
note such a random variable by U[a, b]. We may also talk about uniform distributions 
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Figure 8.2: The uniform distribution. 


over the interval [a,b), (a,b], or (a, b). Indeed, since the probability of taking on any 
specific value is 0 when b > a, ihc distributions are essentially the same. 

The probability distribution function of such an X is 


and its density function is 


F(x) 


fix) 


0 if x < a, 

ia <x <b, 
fx > b, 

0 if x < a, 

[fa <x <b, 
b-a — — 

0 if.v > b. 


These are shown in Figure 8.2. 
The expectation of X is 

E[X] = j( 

and the second moment is 
E[X 2 ]. 


dx 


b 2 — 


b + a 


2{b — a) 


… b — a 

The variance is computed by 

Var[X] =E[X 2 ]- (E[X]) 2 


dx 


3 


3(b — a) 

b 2 + ab + a 2 (b + a) 2 (b-a) 2 


3 4 12 

In our continuous roulette example, the outcome X of the experiment has a uniform 
distribution over [0,1). Thus, the expectation of X is 1/2 and the variance of X is 1/12. 


8.2.1. Additional Properties of the Uniform Distribution 

Suppose you have a random variable X chosen from a uniform distribution, say over 
[0,1], and it is revealed that X is less than or equal to 1/2. With this information, the 
conditional distribution of X remains uniform over the smaller interval [0,1/2], 
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Lemma 8.2: Let X be a uniform random variable on [a,b]. Then, for c < d. 


Pr(X < c \ X < d ) = - • 

d — a 

That is, conditioned on the fact that X < d, X is uniform on [a,d]. 


Proof: 


Pr(X <c\X <d) = 


Pr((X < c) n (X < d)) 


Pr(X < d) 


_ Pr(X < c) 
— Pr(X < d) 


c — a 
d — a 


It follows that X, conditioned on being less than or equal to cl, has a distribution func¬ 
tion that is exactly that of a uniform random variable on [a,d], ■ 


Of course, a similar statement holds if we consider Pr(X < c \ X > d): conditioned 
on X > d, the resulting distribution is uniform over [d, b]. 

Another fact about the uniform distribution stems from the intuition that, if n points 
are uniformly distributed over an interval, we expect them to be roughly equally spaced. 
We can codify this idea as follows. 

Lemma 8.3: Let X\, Xi . X n be independent uniform random variables over [0. 1 ]. 

Let Y\,Y 2 ,.. .,Y n be the same values as X\. X ： . X,, hi increasing sorted order. Then 

E[r 々 ] = /：/(« 十 l). 


Proof: Let us first prove the result for Y\ with an explicit calculation. By definition. 
> =min(X], X 2 _ _ X n ). Now 

Pr(F] > v) = Pr(min(Xi, X 2 ,. ...X n ) > y) 

=Pr((X] > y) n (X 2 > _v) n • • • n (X n > v)) 

n 

= f]Pr(X ( > v) 

/ =1 

= (1-v)". 


I: follows from Lemma 8.1 that 


厂 1 

E[Ti] = / (1 — y) n dy = — —. 

Jy=() n + 1 

\lternatively, one could use F(y) = 1 - (1 - y) n so that the density function of Y\ is 
- ' y) = n{\ — and hence using integration by parts yields 

E[r,]= /' nv(l-y) ，7 -'^v = -v(l-v) w |d) + T (1-v)^v = 

Jx=() Jx=() « 十 i 
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p 0 Ps 




P , P s P 、 P , P, P e 


Figure 8.3: A correspondence between random points on a circle and random points on a line. 


This analysis can be extended to find E[y 々 ] with some computation, which we leave 
as Exercise 8.5. A simpler approach, however, makes use of symmetry. Consider the 
circle of circumference 1, and place n + 1 points P 0 , P\ ， … ， P,! independently and uni¬ 
formly at random on the circle. This is equivalent to choosing each point by a spin of 
the continuous roulette wheel of Section 8.1.1. Label the point as 0, and let X ； be 
the distance traveling clockwise from Pq to P; _ The X ； are then independent, uniform 
random variables from [0,1], The value Y k is just the distance to the kth point reached 
traveling clockwise from Pq. See Figure 8.3. 

The distance between Y k and Y k+ \ is the length of the arc between the two corre¬ 
sponding adjacent points. By symmetry, however, all of the arcs between adjacent 
points must have the same expected length. The expected length of each arc is there¬ 
fore \/(n + 1), since there are « 十 1 arcs created by the n points and since their total 
length is 1. By the linearity of expectations, E[F^] is the sum of the expected lengths 
of the first k arcs, and hence E[Y k ] = k/(n + 1). ■ 


This proof makes use of an interesting one-to-one correspondence between choosing 
n points independently and uniformly at random from [0,1] and choosing n + 1 points 
independently and uniformly at random from the boundary of the circle with circumfer¬ 
ence 1. Such relationships, when they are available, can often greatly simplify an other¬ 
wise lengthy analysis. We develop other similar relationships throughout this chapter. 


8.3. The Exponential Distribution 


Another important continuous distribution is the exponential distribution. 


Definition 8.3: An exponential distribution with parameter 6 is given by the following 
probability distribution function，. 


F(x)= 


-6x 


0 


for x > 0, 
otherwise. 
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(a) f(x) = 6^-"', t > 0. (b) Fix) = 1 - t > 0. 

Figure 8.4: The exponential distribution. 


The density function of the exponential distribution is 

f(x) = for x>0. 

See Figure 8.4. 

Its first and second moments are 


E[X] = / tOoT 1 dt =-, 
Jo ㈠ 


E[X 2 ] 


-0q~ h, dt 




Hence, 


Var[X] =E[X 2 ] - (E[X\r 




8.3.1. Additional Properties of the Exponential Distribution 

Perhaps the most important property of the exponential distribution is that, like the dis¬ 
crete geometric distribution, it is memoryless. 


Lemma 8.4: For an exponential random variable with parameter 6, 


Proof: 


Pr(X > s + t \ X > t) — Pr(X > 5). 


Pr(X > .s- + r I X > r)= 


Pr(X > s + r) 
Pr(X > t) 


1 — Pr(X < 5 + r) 
— 1 — Pr(X < t) 


Q -0(s + t) 
e — 的 


=Pr(X > s). 
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The exponential distribution is the only continuous memoryless distribution. It can 
be viewed as the continuous version of the discrete geometric distribution, which is 
the only discrete memoryless distribution. The geometric distribution models the time 
until first success in a sequence of independent identical Bernoulli trials, whereas the 
exponential distribution models the time until the first event in a memoryless continu¬ 
ous time stochastic process. 

The minimum of several exponential random variables also exhibits some interest¬ 
ing properties. 


Lemma 8.5: lfX\, X 2 , ■ ..,X n are independent exponentially distributed random vari¬ 
ables with parameters 6 \, 02,.. .,0 n , respectively, then min( X\,Xj,.. .,X n ) is exponen¬ 
tially distributed with parameter '!= \ ⑽ d 


Pr(min(X], X 2 , ■. ■, X n ) = X ；)= 





Proof: It suffices to prove the statement for two exponential random variables; the gen¬ 
eral case then follows by induction. Let X\ and X 2 be independent exponential random 
variables with parameters 9\ and 0 2 . Then 

Pr(min(Xi,X 2 ) > x) = Pr((X| > j) 门 (X 2 > x)) 

=Pr(X, > x) Pr(X 2 > x) 

= q~ 0 ]X q~ 9ix 

― Q -(0i+0 2 )x 


Hence the minimum has an exponential distribution with parameter 6 1 + 02. 

Moreover, let f(x\,x 2 ) be the joint distribution of (X,, X 2 ). Since the variables are 
independent, we have f{x\,X 2 ) = 9\e^°'' x W2Q~ 62X2 . Hence 


Pr(X, < X 2 )= 



/ f{x\,x 2 )dx\ dx 2 

Jx\ = i) 



e X Q~ exxx dx 


dxj 



0 2 q- 62X2 (\ -e- 0]X2 )dx 2 

{e 2 t~° l X2 - d 2 e~ iO]+02) - X2 


) dX2 


=1 - 

e^+e 2 

= 0\ 

01+02 


■ 


For example, suppose that an airline ticket counter has n service agents, where the time 
that agent i takes per customer has an exponential distribution with parameter 6,. You 
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stand at the head of the line at time Tq, and all of the n agents are busy. What is the 
average time you wait for an agent? 

Because service times are exponentially distributed, it does not matter for how long 
each agent has been helping another customer before time To ： the remaining time for 
each customer is still exponentially distributed. This is a feature of the memoryless 
property of the exponential distribution. Lemma 8.5 therefore applies. The time un¬ 
til the first agent becomes free is exponentially distributed with parameter 6 h so 
the expected waiting time is 1 / ,l j={ 9,-. Indeed, you can even determine the proba¬ 
bility that each agent is the first to become free: the 7 th agent is first with probability 

8.3.2.* Example: Balls and Bins with Feedback 

As an application of the exponential distribution, we consider an interesting variation 
of our standard balls-and-bins model. In this problem we have only two bins, and balls 
arrive one by one. Initially both bins have at least one ball. Suppose that, if bin 1 has 
.v balls and bin 2 has _v balls, then the probability that bin 1 obtains the next ball is 
.v/(.v + v) while the probability that bin 2 obtains the next ball is _v/(.v 4 - _v). This sys¬ 
tem has feedback: the more balls a bin has. the more balls it is likely to obtain in the 
future. An equivalent problem is given in Exercise 1.6. You may wish to check (by in¬ 
duction) that, if both bins start with one ball and there are n total balls, then the number 
of balls in bin 1 is uniformly distributed in the range [ 1 . /; — 1 ]. 

Suppose instead that we strengthen the feedback in the following way. If bin 1 has 
.v balls and bin 2 has v balls, then the probability that bin 1 obtains the next ball is 
x p /{x p + y p ) and the probability that bin 2 obtains the next ball is y p /{x p + y p ) for 
some p > \. For example, when p = 2. if bin 1 has three balls and bin 2 has four balls, 
then the probability that the next ball goes into bin 1 is only 9/25 < 3/7. Setting p > 1 
strengthens the advantage of the bin with more balls. 

This model has been suggested to describe economic situations that result in mo¬ 
nopoly. For example, suppose there are two operating systems, Lindows and Winux. 
Users will tend to purchase machines with the same operating system that other users 
have in order to maintain compatibility. This effect might be nonlinear in the number 
of users of each system; this is modeled by the parameter p. 

We now show a remarkable result: as long as /? > 1, there is some point at which 
one bin obtains all the rest of the balls thrown. In the economic setting, this is a very 
strong form of monopoly; the other competitor simply stops obtaining new customers. 

Theorem 8.6: Under any starting conditions, if p > 1 then with probability 1 there 
exists a number c such that one of the Two bins gets no more than c balls. 


Note the careful wording of the theorem. We are not saying that there is some fixed c 
(perhaps dependent on the initial conditions) such that one bin gets no more than c balls. 
(If we meant this, we would say that there exists a number c such that, with probabil¬ 
ity 1, one bin gets no more than c balls.) Instead, we are saying that, with probability 1, 
at some point (which we do not know ahead of time) one bin stops receiving balls. 
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Bin 1 


Bin 2 


Bins 1 and 2 


Figure 8.5: In the setup where the time between ball arrivals is exponentially distributed, each bin 
can be considered separately; an outcome of the original process is obtained by simply combining 
the timelines of the two bins. 


Proof: For convenience, assume that both bins start with one ball; this does not affect 
the result. 

We start by considering a very closely related process. Consider two bins that start 
with one ball at time 0. Balls arrive at each of the bins. If bin 1 obtains its zth ball at 
time t then it obtains its next ball at a time t + T Z , where is a random variable expo¬ 
nentially distributed with parameter z p . Similarly, if bin 2 obtains its zth ball at time t 
then it obtains its next ball at a time r + [/-, where U z is also a random variable expo¬ 
nentially distributed with parameter z p . All values of T z and U z are independent. Each 
bin can be considered independently in this setup; what happens at one bin does not 
affect the other. 

Although this process may not seem related to the original problem, we now claim 
that it mimics it exactly. Consider the point at which a ball arrives, leaving x balls in 
bin 1 and y balls in bin 2. By the memoryless nature of the exponential distribution, 
it does not matter which bin the most recently arrived ball has landed in; the time for 
the next ball to land in bin 1 is exponentially distributed with mean x— p and the time 
for the next ball to land in bin 2 is exponentially distributed with mean y~ p . Moreover, 
by Lemma 8.5, the next ball lands in bin 1 with probability x p /{x p + y p ) and in bin 2 
with probability y p /(x p + y p ). Therefore, this setup mimics exactly what happens in 
the original problem. See Figure 8.5. 

Let us define the saturation time F\ for bin 1 by F] = and similarly F 2 — 

Uj. The saturation time represents the first time in which the total number of balls 
received by a bin is unbounded. It is not clear that saturation times are well-defined 
random variables: What if the sum does not converge, and thus its value is infinity? It 
is here that we make use of the fact that /? > 1. We have 

L V=1 」 ./ = ! 7 = 1 J 

Here we used linearity of expectations for a countably infinite summation of random 
variables, which holds if E[|7}|] converges. (Chapter 2 discusses the applicabil¬ 
ity of the linearity of expectations to countably infinite summations; see in particular 
Exercise 2.29.) It suffices to show that \/j p converges to a finite number when¬ 
ever p > \. This follows from bounding the summation by the appropriate integral: 
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Indeed, all of the integral moments converge to a finite number. It follows that both F\ 
and Fj are, with probability 1, finite and hence well-defined. 

Furthermore, F\ and F 2 are distinct with probability 1. To see this, suppose that the 
values for all of the random variables T : and U- are given except for T\. Then, for F\ 
to equal F 2 , it must be the case that 


oc oc 

T ' = H U 「Yjr 

./ = ! ,/ = 2 

But the probability that T\ takes on any specific value is 0, just as the probability that 
our roulette wheel takes on any specific value is 0. Hence, F\ ^ F 2 with probability 1. 
Suppose that F\ < Fi. Then we must have for some n that 

n 十 1 

./=! 7=1 

This implies that, for any sufficiently large number m, 

n ni —1 

/ — 1 ， = 1 J = 1 

which means that bin 1 has obtained m balls before bin 2 has obtained its (" — 1 )th 
ball. Since our new process corresponds exactly to the original balls-and-bin> process, 
this is also what happens in the original process. But this means that, once bin 2 ha> n 
balls, it does not receive any others; they all go to bin 1. The argument is the same it 
F： < F\. Hence, with probability 1, there exists some n such that one bin obtains no 
more than n balls. ■ 

When p is close to 1 or when the bins start with a large and nearly equal number of 
balls, it can take a long time before one bin dominates enough to obtain such a mo¬ 
nopoly. On the other hand, monopoly happens quickly when p is much greater than 1 
'Mich as /? = 2) and the bins start with just one ball each. You are asked to simulate 
this process in Exercise 8.24. 


8.4. The Poisson Process 


The Poisson process is an important counting process that is related to both the uni¬ 
form and the exponential distribution. Consider a sequence of random events, such as 
arrivals of customers to a queue or emissions of alpha particles from a radioactive ma¬ 
terial. Let N(t) denote the number of events in the interval [0, r]. The process {N{t、, 
: > 0} is a stochastic counting process. 
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Definition 8.4: A Poisson process with parameter (or rate) "k is a stochastic counting 
process {N{t), z 1 > 0} such that the following statements hold. 


1. N{0) = 0. 

2. The process has independent and stationary increments. That is, for any t,s > 0, 
the distribution of N(t + ,v) — N(s) is identical to the distribution of N(t), and for 
any two disjoint intervals [?i, tj] and [?3 ， ,4], the distribution of N{ti) — N(t\、is 
independent of the distribution of N(t 4 ) — jV (/^). 

3. lim,^o Pr(A^(0 = \)/t = X. That is, the probability of a single event in a short 
interval t tends to ~kt• 

4. lim^o Pr(A^(r) > 2)/t = 0. That is, the probability of mure than one event in a 
short interval t tends to zero. 

The surprising fact is that this set of broad, relatively natural conditions defines a unique 
process. In particular, the number of events in a given time interval follows the Poisson 
distribution defined in Section 5.3. 


Theorem 8.7: Let \N{t) \ t > 0} be a Poisson process with parameter X. For any 
t,s > 0 and any integer « > 0, 

_ ' _ u {Xt) n 

P n (t) — Pr{N(t + s) — NiX) = n) = q - . 

n\ 


Proof: We first observe that P n ( t) is well-defined since, by the second property of Def¬ 
inition 8.4, the distribution of N(t + .v) — N(s) depends only on t and is independent 
of 

To compute Po(t), we note that the number of events in the intervals [0, r] and 
(r, t + h] are independent random variables and therefore 


尸 0 (r + h) = P 0 (t)P 0 (h). 


We now write 

Po(t^h) - 
h 


Po(t) 


Po(h) 


尸 () （0 - 


- Pr(N(h) = \)-Pr(N(h) > 2) 


尸 o(0 


h 

—Pr(W(/z) = 1) - Vr{N{h) > 2) 
h 


Taking the limit as A ^ 0 and applying properties 2-4 of Definition 8.4, we obtain 
Po(t + h)~ 尸 ()（0 




/ I ?/。 ⑴ 

一入 


h 

-Vr{N{h) = 1) - Vr(N{h) > 2) 


To solve 


Pl)(t) — -XPoit), 
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we rewrite it as 

w 

Po(t) 

Integrating with respect to t gives 


In 尸。⑴ = —入, + C ， 
or 

p 0 ( t ) = t~ XT+c . 


Since Pq(0) = 1, we conclude that 


尸 o(0 = e~ A? . 

For n > 1, we write 

n 

Pn(t -^h) = Y1 P n-k(t)Pk(h) 
k=() 

n 

= P n (t)P 0 (h) + P ； ! _i(r)P!(/z) + ^ P n _ k (t) Fr(N(h) = k). 

k = 2 

Computing the first derivative of P u (t) yields 


( 8 . 1 ) 


P ； M) = lim 


Pn(t + ")— 尸 "(0 

h 


lim — ( P n (t)( P{)(h) — 1 ) 十 P n -\(t)P\(h) ' P u - t) Pv( \ (h ) = k) 

“0 h V ^ y 

—入尸 《 ⑴ + 入尸'卜1(0， 


where we use the facts that 


I by properties 2 and 3) and 


lim 

IhO 


P\(h) 

h 


=A 


0 < lim - 

_ IhQ h 


jy n — k (t、Pr(N(h) 




Pr(A^(/;) > 


I by property 4), so 

To solve 

we write 
which gives 


1 

lim -V P„— k (t) Vr(N(h) = k) = 0. 
h—oh ^ 

k=2 

尸 ,((7) = —kP n (t) + kP n ^\(t) 


( 8 . 2 ) 
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Using (8.1) then yields 


implying 


-(^P ] (t)) = X^P {) (t) = X. 


P\{t) = (Ar+c)e- x/ . 

Since P\(0) = 0, we conclude that 

Pi(r )= 入 re—;". 

We continue by induction on n to prove that, for all n > 0, 

aty 


PM) = e 


nl 


Using Eqn. (8.2) and the induction hypothesis, we have 

d ,, u X n t ,l ~ l 

— (e A, P„(t)) = = - - — 

dt (n - 1)! 

Integrating and using the fact that 尸 "(0) = 0 gives the result. 


(8.3) 


The parameter 入 is also called the rate of the Poisson process, since (as we have proved) 
the number of events during any time period of length t is a Poisson random variable 
with expectation \ t. 

The reverse is also true. That is, we could equivalently have defined the Poisson 
process as a process with Poisson arrivals, as follows. 


Theorem 8.8: Let \N(t) \ t >0} be a stochastic process such that: 

1. N(0) = 0; 

2 . the process has independent increments (i.e., the number of events in disjoint time 
intervals are independent events): and 

3. the number of events in an interval of length t has a Poisson distribution with mean kt • 
Then {A^(r) | r > 0} is a Poisson process with rate X. 


Proof: The process clearly satisfies conditions 1 and 2 of Definition 8.4. To prove con¬ 
dition 3, we have 

Pv(N(t) = 1) e~ kt Xt 

lim - = lim - = A. 

/->() t t^o t 


Condition 4 follows from 


lim Pr(N ^ 

t 


t- k, (Xt) 2 
~~It 


= 0. 


8.4.1. Interarrival Distribution 

Let X] be the time of the first event of the Poisson process, and let X n be the inter¬ 
val of time between the (n — l)th and the nth event. The X n are generally referred to 
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as inter arrival times ， since they represent the time between arrivals of events. Here, 
we show that all of the X n have the same distribution and that this distribution is 
exponential. 

We begin by deriving the distribution of X\. 

Theorem 8.9: X\ has an exponential distribution with pcimmeter X. 


Proof: 

Thus, 


Pr(X, > t ) 二 Pr(N(t) = 0) = e 一乂 
F(X\ ) = 1 — Pr(Xj >0 = 1 — e—. 


■ 


Using the fact that the Poisson process has independent and stationary increments, we 
can prove the following stronger result. 

Theorem 8.10: The random variables X[, i = 1,2. are itidepcndent. identically 

distributed, exponential random variables with parameter /.. 

Proof: The distribution of X- t is given by 

Pr(X ; - > ti I (x 0 = r 0 ) n d = n) n ••• n (x, —! = r — • 11 

=穩 4 )-番 H 

=e-' 

Thus, the distribution of X, is exponential with parameter a. and it i 、 independent ot 
other interarrival values. ■ 


Theorem 8.10 states that, if we have a Poisson arrival process, then the interarrival times 
ore identically distributed exponential random variables. In fact, it is eas> to check that 
the reverse is also true (this is left as Exercise 8.17). 


Theorem 8.11: Let {N{t) \ t > Q) be a stochastic process such that : 

1. ‘V(0) = 0; and 

2 . the interarrival times are independent, identically distributed, exponential random 
variables with parameter 

Then {N(t) \ t > 0} is a Poisson process with rate X. 


8.4.2. Combining and Splitting Poisson Processes 

The correspondence between Poisson processes and exponentially distributed inter- 
imval times is quite useful in proving facts about the behavior of Poisson processes. 
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One immediate fact is that Poisson processes combine in a natural way. We say that 
two Poisson processes N\(t) and N 2 (t) are independent if and only if the values N\(t) 
and N 2 (u) are independent for any t and u. Let A^i(/) + N 2 (t) denote the process that 
counts the number of events corresponding to both of the processes N\(t) and N 2 (t). 
We show that, if N\(t) and are independent Poisson processes, then they com¬ 
bine to form a Poisson process "〆/) + N 2 (t). 

Theorem 8.12: Let N\{t) and N 认 t) be independent Poisson processes with parame- 
ters \\ and 入 2 , respectively. Then N\{t) + N 2 {t) is a Poisson process with parameter 
入 1 + 入 2 , and each event of the process N\(t) + A^2(0 arises from the process N\(t) 
with probability 入 丨/( 人 1 + 入 2 ). 


Proof: Clearly /^(O) + yv 2 (0) = 0, and since the two processes are independent and 
each has independent increments, the sum of the two processes also has independent 
increments. The number of arrivals /V](0 十〜 2 ( 广 ） is a sum of two independent Pois¬ 
son random variables, which (as we saw in Lemma 5.2) has a Poisson distribution with 
parameter 入 1 + 入 2 . Thus, by Theorem 8.8, yVK/) + N 2 (t) is a Poisson process with rate 
入 1 + 入 2. 

By Theorem 8.9, the interarrival time for N\(t) + is exponentially distributed 
with parameter 入 ]+ 入 2 , and by Lemma 8.5 an event in N\(t) + " 2 ⑴ comes from the 
process N\(t) with probability X\/{X\ + X 2 ), ■ 


The theorem extends to more than two processes by induction. 

It is interesting to note that Poisson processes can be split as well as combined. Tf 
we split a Poisson process with rate 入 by labeling each event as being either type 1 
with probability p or type 2 with probability (1 — p), then it seems that we should get 
two Poisson processes with rates kp and 入 （1 — /?). In fact, we can say something even 
stronger: the two processes will be independent. 


Theorem 8.13: Suppose that we have a Poisson process N(t) with rate k. Each event 
is independently labeled as being type 1 with probability p or type 2 with probabil¬ 
ity \ — p. Then the type-1 events form a Poisson process N\(t) of rate 入 /?， the type-2 
events form a Poisson process Ni{t) of rate 入 （1 — p), and the two Poisson processes 
are independent. 


Proof: We first show that the type-1 events in fact form a Poisson process. Clearly 
/Vi(0 = 0, and since the process N(t) has independent increments, so does the process 
N\(t). Next we show that N\[t) has a Poisson distribution: 


PrWKO = k) = ^Pr(^(r) =k \ N(t) = j)Pr(N(t) = j) 



p k (\-p) J ~ 
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- kp! (x P ty 


E 




CAt(\- P )y- 


~k)\ 


Q- kl ，， (Xpt) k 
= ' 

Thus, by Theorem 8 . 8 , A^i(0 is a Poisson process with rate /./?. 


To show independence, we need to show that ,Y!(n and V : (u) are independent for 
any t and u. In fact, it suffices to show that N](r) and A' : (n are independent for any t\ 
we can then show that N\(t) and Ni{u) are independent tor an_\ r and u by taking ad¬ 
vantage of the fact that Poisson processes have independent and stationary increments 
(see Exercise 8.18). We have: 


Pr((N\(t) = m) H (N 2 (t) = n)) = Vr{{N(t) = — ;n ~ ( A' ： ( t ) = n)) 

e~ kt (Aty^ n /m , n 

- (w+//)! ( n I" I ") 

e — 4 入/广" 〃，丨 , 

= - ~~p <i — 厂 I 

ml n\ 

Q~ ktp (Xtp) m e— 卜 I/.，i 1 — / 川 , ： 

m\ n'. 

=Pr(A^i(r) = in ) Pr( A'：( n = n). ■ 


8.4.3. Conditional Arrival Time Distribution 


We have used the fact that a Poisson process has independent increments to show that 
the distribution of the interarrival times is exponential. Another application ot' thi 、 as¬ 
sumption is the following: If we condition on exactly one event occurring in an interval, 
then the actual time at which that event occurs is uniformly distributed o\ cr that inter¬ 
nal. To see this, consider a Poisson process where N{t) = 1. and consider the time X! 
«'f the single event that falls in the interval ( 0 , r]: 


Pr(Xi < ^ I N(t) = 1)= 


Pr((X, < s) n (N(t) = 1" 

Pr(W ⑴ =1) 

Pr((W ⑴ =1) n (Nit) - A'(.s) - On 


Pr(N(t) = 1) 


(h’e _Av )e _; “’ _v) 

ktQ~' Ar 


s 

r • 

Here we have used the independence of N(s) and N(t) — N{s). 

To generalize this to the case of N(t) = n, we use the concept of order statistics. Let 

X__ X n be n independent observations of a random variable. The order statistics of 

\. X n consists of the n observations in (increasing) sorted order. For example, if 

V . .V 2 . X 3 , X 4 are independent random variables generated by taking a number chosen 
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uniformly on [0,1] and rounding to two decimal places, we might have X\ = 0.47, 
Xi = 0.33, X 3 = 0.93, and X 4 = 0.26. The corresponding order statistics, where Y {i) 
is used to refer to the /th smallest, would be F ⑴ = 0.26, Y {2 ) = 0.33, F (3 ) = 0.47, and 
Y {4) = 0.93. 


Theorem 8.14: Given that N{t) = n, the n arrival times have the same distribution as 
the order statistics of n independent random variables with uniform distribution over 
[ 0 ,r]. 


Proof: We first compute the distribution of the order statistics of n independent ob¬ 
servations X\,X 2 ,. ..,X n drawn from a uniform distribution in [0, t]. Let Y ■ ⑴， ..., Y {n) 
denote the order statistics. 

We want an expression for 

PrC^l) < .?1, ^ 2 ) < .V2, ••• ， y^n) S S n ). 

Let £ be the event that 


Rl) H F ( 2 ) < 52 , Y (ll) < s„. 

For any permutation i n of the numbers from \ ton, let 

that 

^/i < ^i 2 S 、 ’ 2 ， ^i„-\ — ^h, — s »- 


in be the event 


The events 卜 , 2in are disjoint, except for the cases where X- tj = X ij+] for some j. 
Since two uniform random variables are equal with probability 0, the total probabil¬ 
ity of such events is 0 and can be ignored. By symmetry, all events have the 

same probability. Also, 

where the union is over all permutations. It follows that 


Pr(> / ( 1 ) < .^ 1 , ^( 2 ) < ^ 2 . Y(h) < s„) 

= 〉 ^ Pr ( Xj | < ^1 ， < Xj 2 ^ S2, • • •, ^i„_\ ^ ^i„ S s n) 

=n\Pr(X\ < . 91 , Xi < X 2 < ^ 2 , •••, < X n < s n ), 


where the sum in the second line is over all n\ permutations. If we now think of u t as 
representing the value taken on by X/, then 


Pr(Xi < Xi < X2 < ^2, ..., X n -\ < X n < s n ) 

=r r -r o " …伽， 

Ja 1 =0 JU2 = i<] Ju n =u n _] \ ^ / 

where we use the fact that the density function of a uniform random variable on [ 0 , t] 
is f{t) = 1/r. This gives 


Pr(F ( i) < ^' 1 , Y (1) 


5 Y(n) ^ ) 


du„ ■ ■ ■ du\ 


^ii] =0 J "2 = "1 
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We now consider the distribution of the arrival times for a Poisson process, condi¬ 
tioned on N(t) = n. Let S\,S fl+ \ be the first n + 1 arrival times. Also, let T]= 
5] and 7} = 5 1 ,. — be the length of the interarrival intervals. By Theorem 8.10. we 
know that (a) without the condition N(t) = n, the distributions of the random vari¬ 
ables T\,. ..,T n are independent, and (b) for each i, T t has an exponential distribution 
with parameter 入 . Recalling that the density function of the exponential distribution is 
AQ~ At , we have 


Pr(5i < -yi, S 2 < s 2 , S n < s n , N(t) = n) 


Pr(r, <^ 1 , T 2 <s 2 -T u ...,T n <s n T n+l > t — 


ps\ ^S 2 -t\ r^n-h i = ] ti pec 

Jt\ =o J /2—0 Jt u =o Jt fl ^\=t—J 2 ] 

Integrating with respect to t n+ \ then yields 


入奸成 )+1 …办 




成 … =— 入卞-人 


入 ” e- 


Thus, 


Pr(5i < 5i ，Si < . S n < s n . N(t) — n) 



where the last equation is obtained by substituting Uj = t t . 

Since 

Pr(N(t) =n) = Q~ kt - 

n\ 

ind because the number of events in an interval of length t has a Poisson distribution 
uith parameter 入 /， the conditional probability computation gives 


Pr(5i < 5 2 < 5 2 , ..., I N(t) = n) 

_ Pr(^i < .yi, S 2 <s 2 , S n < s ny N(t) = n) 
Vr{N{t) = n) 



r S n 

J if n =U n — \ 


du n ■ ■ ■ du\. 


This is exactly the distribution function of the order statistics, proving the theorem. ■ 
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8.5. Continuous Time Markov Processes 


In Chapter 7 we studied discrete time and discrete space Markov chains. With the 
introduction of continuous random variables, we can now study the continuous time 
analogue of Markov chains, where the process spends a random interval of time in a 
state before moving to the next one. To distinguish between the discrete and continu¬ 
ous processes, when dealing with continuous time we speak of Markov processes. 

Definition 8.5: A continuous time random process {X, | r > 0} is Markovian (or is 
called a Markov process) if, for all s, t > 0: 

Pr(X( / y + 0 = ^1 X(w), 0 < w < 0 = Pr(X(.v - 1 ) = x \ X(t)), 
and this probability is independent of the time t. 2 

The definition says that distribution of the state of the system at time X(s + r), condi¬ 
tioned on the history up to time /, depends only on the state X(t) and is independent of 
the particular history that led the process to state X(t). 

Restricting our discussion to discrete space, continuous time Markov processes, 
there is another equivalent way of formulating such processes that is more convenient 
for analysis. Recall that a discrete time Markov chain is determined by a transition ma¬ 
trix P = (Pij), where P L/ is the probability of a transition from state i to state j in 
one step. A continuous time Markov process can be expressed as a combination of two 
random processes as follows. 

1. A transition matrix P 二 (/?,.；), where pij is the probability that the next state 
is j given that the current state is /. (We use lowercase letters here for the transi¬ 
tion probabilities in order to distinguish them from the transition probabilities for 
corresponding discrete time processes.) The matrix P is the transition matrix for 
what is called the embedded or skeleton Markov chain of the corresponding Markov 
process. 

2. A vector of parameters { 9 \, 62 ,...) such that the distribution of time that the process 
spends in state i before moving to the next step is exponential with parameter 6 ,. 
The distribution of time spent at a given state must be exponential in order to satisfy 
the memoryless requirement of the Markov process. 


A formal treatment of continuous time Markov processes is more involved than their 
discrete counterparts, and a full discussion is beyond the scope of this book. We limit 
our discussion to the question of computing the stationary distribution (also called 
equilibrium distribution) for discrete space, continuous time processes, assuming that 
a stationary distribution exists. As for the discrete time case, the value n l in a stationary 
distribution fc gives the limiting probability that the Markov process will be in state i 
infinitely far out in the future, regardless of the initial state. That is, if we let Pj ^,■ (r) be 
the probability of being in state i at time t when starting from state j at time 0, then 

2 Technically, as with the discrete time Markov chains, this is a time-homogeneous Markov process; this will be 
the only type we study in this book. 
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⑺二 . 


Similarly, tt ； gives the long-term proportion of the time the process is in state i. 
Furthermore, if the initial state j is chosen from the stationary distribution, then the 
probability of being in state i at time t is 7r, for all t. 

To determine the stationary distribution, consider the derivative P; 




i-o 


lim 

h^O 


Pj ， i(t + h) — Pjj(t ) 
h 

T Jk PjAt)PkAh)~ p hl (t) 


h 




Pu(h) 


PjAt) 


Since the distribution of time spent at state k is exponential with parameter we 
can use the properties of the Poisson process to observe that, as h tends to zero, the 
limiting probability of a transition out of state /: in an interval of length h is h0 ^. and 
the limiting probability of more than one transition is 0. Thus, 


lim 


PuUn 


Similarly, 1 — P\.i{h) is the probability that a transition occurs over the interval of time 
h. and the transition is not from state i back to itself. Thus. 


lim 

/;—() 


- Pi.iih) 


=^(1 - Pi.i )• 


We now assume that we can interchange the limit and the summation; we empha¬ 
size that this interchange is not always justified for countably infinite spaces. Subject 
to this assumption, 


Pk.i(h) 1 - Pi ： i(h) ^ 

〜⑴- 厂-〜 ⑴ J 


= OkPk.iPj,k(t) — PjAOWi — OiPi.i) 

k^i 

- Y^°kPk.iPj.k(t) - OiPjj(t). 


Taking the limit as / ^ co, we have 

Pj t {t) = \\r^Y^ e kPkjPjAt) - OiPi,i(t) = ^0 k p U 7T k - 0 , 丌 ,. 

一 °° 一 °° k k 

If the process has a stationary distribution, it must be that 


⑴ = 0 . 
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Otherwise, Pjj(t) would not converge to a stationary value. Hence, in the stationary 
distribution n we have the following rate equations: 

TTiOi = Y^^kOkPk.i- (8.4) 

k 

This set of equations has a nice interpretation. The expression on the left, 兀 / 化 ， is the 
rate at which transitions occur out of state i. The expression on the right, TikOkPk.i- 
is the rate at which transitions occur into state i. (A transition that goes from state / 
back to state i is counted both as a transition into and as a transition out of state /.) 
At the stationary distribution, these rates must be equal, so that the long-term rates of 
transitions into and out of the state are equal. This equalization of rates into and out 
of every state provides a simple, intuitive way to find stationary distributions for con¬ 
tinuous Markovian processes. This observation can be generalized to sets of states, 
showing that a result similar to the cut-set equations of Theorem 7.9 for discrete time 
Markov chains can be formulated for continuous time Markov processes. 

If the exponential distributions governing the time spent in all of the states have the 
same parameter, so that all the 6^. are equal, then Eqn. (8.4) becomes 

丌， = 

k 


This corresponds to 


n = ttP, 


where P is the transition matrix of the embedded Markov chain. We can conclude that 
the stationary distribution of the continuous time process is the same as the stationary 
distribution of the embedded Markov chain in this case. 


8.6. Example: Markovian Queues 


Queues appear in many basic applications in computer science. In operating systems, 
schedulers can hold tasks in a queue until the processor or other required resources are 
available. In parallel or distributed programming, threads can queue for a critical sec¬ 
tion that allows access to only one thread at a time. In networks, packets are queued 
while waiting to be forwarded by a router. Even before computer systems were preva¬ 
lent, queues were widely studied to understand the performance of telephone networks, 
where similar scheduling issues arise. In this section we analyze some of the most basic 
queueing models, which use Poisson processes to model the stochastic process of cus¬ 
tomers arriving at a queue and exponentially distributed random variables to model the 
time required for service. 

In what follows, we refer to queue models using the standard notation Y/Z/n, where 
Y represents the distribution of the incoming stream of customers, Z represents the ser¬ 
vice time distribution, and n represents the number of servers. The standard notation 
for a Markovian or memoryless distribution is M. Thus, M/M/n stands for a queue 
model with customers arriving according to a Poisson process and served by n servers 
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having identical and independent exponentially distributed service times. Other queue¬ 
ing models include the M/M/oo model, where there are an infinite number of servers, 
and the M/G/l model, where the G indicates that the service time can be any arbitrary 
general distribution. 

A queue must also have a rule for determining the order in which customers are 
served. Unless otherwise specified, we assume that a queue follows the First In First 
Out (FIFO) rule, where customers are served in order of their arrival. 

8.6.1. M/M/1 Queue in Equilibrium 

Assume that customers arrive to a queue according to a process with parame¬ 

ter 入， and assume they are served by one server. The service times for the customers 
are independent and exponentially distributed with parameter n . 

Let M(t) be the number of customers in the queue at time t. Since both the arrival 
process and the service time have memoryless distributions, the pR>cess { A/m t > 0} 
defines a continuous time Markov process. We consider the stationarv distribution tor 
this process. 

Let 


/MO = Pr(M ⑴ =k) 

denote the probability that the queue has k customers at time t. Wc use the fact that, 
in the limit as h approaches 0, the probability of an arrival (respecti\d\. a departure) 
over a time interval is Xh (respectively, ^h). Thus, 

dP 0 (t) P Q (t + h) — P 0 (t) 

~~；~ = lim --- 

dt h^o h 

.. 尸。⑴ （1 —入 A ) + 尸 l ⑴ M 乃一尸⑴ 

= lim - 

“0 h 

=— 入尸 0 ⑴ + yLtPl ⑴， 1 > 


and for k > 1, 

dPk(t) . PkU + h) — Pk(t) 

- =h m - 

dt h^o h 

. ⑴ （1 — 入々 _ M A) 十 ^ \-i ⑴入乃十尸人一 i ( t 、 

=lim - 

h 一 0 h 

=- (入十 /i)Pk(t) + 人 /Vi ⑴十 fiPk+\(t). i 8.6) 

In equilibrium, 


dPk(t) 

dt 


= 0 


for k = 0,1,2 


If the system converges to a stationary distribution 3 tt. then applying Eqn. (8.5) 
)iclds 


jA. 7T ] —— A- 7T () • 


X^ain. the proof that the system indeed converges relies on renewal theory and is beyond the scope of this book. 
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This equation has a simple interpretation in terms of rates. In equilibrium, the rate into 
the state where there are no customers in the queue is the rate out is 入丌 o. These 
two rates must be equal. If we write this as 7i\ = 丌 0 ( 入 /".)，then (8.6) and a simple 
induction give 

_ — av 

兀 々 -= ^k-\ l — I = 兀 <)| — I . 






Since J2k>o = U we must have 

丌 o 



Assuming that 人 < g，it follows that 
入 

no = 1 - and 



(8.7) 


If 入 〉 then the summation in Eqn. (8.7) does not converge and, in fact, the system 
does not reach a stationary distribution. This is intuitively clear; if the rate of arrival 
of new customers is larger than the rate of service completions, then the system cannot 
reach a stationary distribution. If 入 =", the system also cannot reach an equilibrium 
distribution, as discussed in Exercise 8.22. 

To compute the expected number of customers in the system in equilibrium, which 
we denote by L, we write 


L = 




where in the third equation we used the fact that the sum is the expectation of a geo¬ 
metric random variable with parameter 1 — X/ji. 

It is interesting that we have nowhere used the fact that the service rule was to serve 
the customer that had been waiting the longest. Indeed, since all service times are 
exponentially distributed and since the exponential distribution is memoryless, all cus¬ 
tomers appear equivalent to the queue in terms of the distribution of the service time 
required until they leave, regardless of how long they have already been served. Thus, 
our equations for the equilibrium distribution and the expected number of customers in 
the system hold for any service rule that serves some customer whenever at least one 
customer is in the queue. 

Next we compute the expected time a customer spends in the system when the sys¬ 
tem is in equilibrium, denoted by W, assuming a FIFO queue. Let L(k) denote the 
event that a new customer finds k customers in the queue. We can write 


Y, kllk 

k~i) 

A 






〔二 l 


x -)( x - 
M 八 M 


fi 1 一 A / fi 
A 


jl 一人 


214 






8.6 EXAMPLE ： MARKOVIAN QUEUES 


oc 

W = I ^(k)]^(L(k)). 

k=0 

Since the service times are independent, memoryless, and have expectation l/fi, it fol¬ 
lows that 

E[W I L(k)] = (k + 1> 丄 . 

" 

To compute Pr(L(k)), we observe that if the system is in equilibrium then the rate 
of transitions out of state k is 11 ^ 6 ^, where Oq — k and h, — /, + for k > \. Applying 
Lemma 8.5, the probability that the next transition from state k is caused by the arrival 
of a new customer is 入 /^；. Therefore, the rate at which customers arrive and find k 
customers already in the queue is 

入 , 

丌々 = 丌 々入. 

Since the total rate of new arrivals to the system is a, we conclude that the probability 
that a new arrival finds k customers in the system is 


Pr(L(/：)) = ~~ = JT k . 

人 

This is an example of the PASTA principle, which states that Poisson Arrivals See 
Time Averages. That is, if a Markov process with Poisson arrivals has a stationary dis¬ 
tribution and if the fraction of time the system is in state k is .t,. then .t, is also the 
proportion of arrivals that find the system in state k when they arrive. The PASTA 
principle, which is due to the independence and memoryless properties of the Poisson 
process, is a useful tool that often simplifies analysis. A proof of the PASTA principle 
for more general situations is beyond the scope of this book. 

We can now compute 


W = I L(k)]Pv(L(k)) 

k=0 


00 


E 


+ 1 


—丌人' 



=-d + ^) 



fi — 入 
L 

X" 
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The relationship L = 入 W is known as Little's result, and it holds not only for M/M/1 
queues but for any stable queueing system. The proof of this fundamental result is 
beyond the scope of this book. 

Although the M/M/1 queue represents a very simple process, it can be useful for 
studying more complicated processes. For example, suppose that we have several types 
of customers entering a queue, with each type arriving according to a Poisson process, 
and that all customers have exponentially distributed service times of mean ji. Since 
Poisson processes combine, the arrival process to the queue is Poisson, and this can be 
modeled as an M/M/1 queue. Similarly, suppose that we have a single Poisson arrival 
process, and we establish a separate queue for each type of customer. If each arriving 
customer is of type i with some fixed probability p“ then the Poisson process splits 
into independent Poisson processes for each type of customer, and hence the queue for 
each type is an M/M/1 queue. This type of splitting might occur, for example, if we 
use separate processors for different types of jobs in a computer network. 


8 . 6 . 2 . M/M/l/K Queue in Equilibrium 


An M/M/l/K queue is an M/M/1 queue with bounded queue size. If a customer ar¬ 
rives while the queue already has K customers, then this customer leaves the system 
instead of joining the queue. Models with bounded queue size are useful for applica¬ 
tions such as network routers, where packets that arrive once the packet buffer is full 
must be dropped. 

The system is entirely similar to the previous example. In equilibrium we have 


and 


7 io(X/ii) k for k < K, 
0 for k> K, 




These equations define a proper probability distribution for any X, ji > 0, and we no 
longer require that X < ji. 


8 . 6 . 3 . The Number of Customers in an M/M/oo Queue 


Suppose new users join a peer-to-peer network according to a Poisson process with 
rate 入 . The length of time a user stays connected to the network has exponential dis¬ 
tribution with parameter fi. Assume that, at time 0, no users were connected to the 
network. Let M {t) be the number of connected users at time t. What is the distribution 
of M(0? 

We can view this process as a Markovian queue with an unlimited number of servers. 
A customer starts being served the moment she joins the system and leaves when she is 
done. We demonstrate two ways of analyzing this process. We first use the rate equa¬ 
tions (8.4) to compute the stationary distribution for the process. The second approach 
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is more complex, but it yields more information: we explicitly compute the distribu¬ 
tion of the number of customers in the system at time t and then consider the limit as t 
goes to infinity. 

To write the rate equations of the process, we observe that if (at a given time) there 
are /: > 0 customers in the system, then the next event can be either termination of ser¬ 
vice of one of the k current customers or the arrival of a new customer. Thus, the time 
to the first event is the minimum of /: + 1 independent exponentially distributed random 
variables; k of these variables have parameter and one has parameter 入 . Applying 
Lemma 8.5 shows that, when there are k customers in the system, the time to first event 
has an exponential distribution with parameter 6 , — k/i — Furthermore, the lemma 
implies that, given that an event occurs, the probability that the event is an arrival of a 
new customer is 


Pk,k + \ 


/. 

乂 k 


and when /: > 1 the probability that the event is the departure of a customer is 


Pk.k— 1 


kfi 

k kfl 


Plugging these values into (8.4), we have that the stationary distribution rr satisfies 


and, for /: > 1, 


We rewrite (8.8) as 


丌 0 入 = 丌 lM 

7T 々（入十 kjU.) — 7l k ^\X + 7l k+ \(k + 1) 以 , 


7T k+ ](k + 1)M = 兀々(人 + k/J.) - 7lk-U- 
= TlkX+llkkll - 7T A ._iA. 


A simple induction yields that 


and therefore 


71 ^kjJ. — 7T 人 i 入， 


入 


^k+\ 


Now, again a simple induction yields 


l^(k + 1 ) 


7r 々 • 


= 7T() 




and therefore 


> =E 


丌 0 


7r () e 


A/M 


We conclude that ttq = e~ A//i and, more generally, 


( 8 . 8 ) 
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^ ^ k\ ~, 

so that the equilibrium distribution is the discrete Poisson distribution with parameter 
入 /M. 

We now proceed with our second approach: computing the distribution of the num¬ 
ber of customers in the system at time t, denoted by M(t), and then considering the 
limit of M(t) as t goes to infinity. Let N(t) be the total number of users that have 
joined the network in the interval [0, ?]. Since N(t) has a Poisson distribution, we can 
condition on this value and write 


(Xt) n 

Pr(M(r) = j) = VPr(M(r) = j \ N(t) = . (8.9) 

^ n\ 

n = 0 

If a user joins the network at time x, then the probability that she is still connected 
at time t is e —" (r —From Section 8.4.3, we know that the arrival time of an arbitrary 
user is uniform on [0, t]. Thus, the probability that an arbitrary user is still connected 
at time t is given by 


P 




dx 


e— ， 


Because the events for different users are independent, for y < « we have 


Pr(A/(r) = j I N(t) = n )= 
Plugging this value into Eqn. (8.9), we find that 
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Thus, the number of users at time t has a Poisson distribution with parameter ~ktp. 
Since 

lim Xtp — lim 入 r 丄 (1 - e -’"） =—, 

r —oo t^oc fjit j! 


it follows that, in the limit, the number of customers has a Poisson distribution with 
parameter X/fi, matching our previous calculation. 
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8.7. Exercises 


Exercise 8.1: Let X and Y be independent, uniform random variables on [0,1], Find 
the density function and distribution function for .Y -- 

Exercise 8.2: Let X and Y be independent, exponentially distributed random variables 
with parameter 1. Find the density function and distribution function for X + Y. 

Exercise 8.3: Let X be a uniform random variable on [0.1 ]. Determine Pr(X < 1/2 | 
1/4 < X < 3/4) and Pv(X < 1/4 | (X < 1/3) U(X >2 3i). 

Exercise 8.4: We agree to try to meet between 12 and 1 for lunch at our favorite sand¬ 
wich shop. Because of our busy schedules, neither of us is Mire when w e'll arrise: we 
assume that, for each of us, our arrival time is uniformly distributed over the hour. So 
that neither of us has to wait too long, we agree that we will each w ait exactly 15 min¬ 
utes for the other to arrive, and then leave. What is the probability we aaiuillv meet 
each other for lunch? 

Exercise 8.5: In Lemma 8.3, we found the expectation of the smallestof " indepen¬ 
dent uniform random variables over [0,1] by directly computing the pnibabiln> that it 
was larger than v for 0 5 .v < 1. Perform a similar calculation to find the probabiln> 
that the kih smallest of the n random variables is larger than y. and use this to show 
that its expected value is k/(n + 1). 


Exercise 8.6: Let X\, X 2 , ■ ■ ■, X n be independent exponential random variables with 
parameter 1. Find the expected value of the /:th largest of the n random variables. 


Exercise 8.7: Consider a complete graph on n vertices. Each edge is asMgned a w eight 
chosen independently and uniformly at random from the real interval [0.1 ]. Show that 
the expected weight of the minimum spanning tree of this graph is at least 1 — 1 (1 — . 

Find a similar bound when each edge is independently assigned a weight from an ex¬ 
ponential distribution with parameter 1. 

Exercise 8.8: Consider a complete graph on n vertices. Each edge is aligned a w eight 
chosen independently and uniformly at random from the real interval (0.1 ]. We pro- 
pose the following greedy method for finding a small-weight Hamiltonian cycle in the 
graph. At each step, there is a head vertex. Initially the head is vertex 1. At each step, 
ue find the edge of least weight between the current head vertex and a new vertex that 
has never been the head. We add this edge to the cycle and set the head vertex to the 
new vertex. After n — 1 steps, we have a Hamiltonian path, which w e complete to make 
a Hamiltonian cycle by adding the edge from the last head vertex back to vertex 1. 
What is the expected weight of the Hamiltonian cycle found by this greedy approach? 
Also, find the expectation when each edge is independently assigned a weight from an 
exponential distribution with parameter 1. 
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Exercise 8.9: You would like to write a simulation that uses exponentially distributed 
random variables. Your system has a random number generator that produces indepen¬ 
dent, uniformly distributed numbers from the real interval (0,1). Give a procedure that 
transforms a uniform random number as given to an exponentially distributed random 
variable with parameter A.. 

Exercise 8.10: Let n points be placed uniformly at random on the boundary of a circle 
of circumference 1. These n points divide the circle into n arcs. Let Z, for l < Z, < n 
be the length of these arcs in some arbitrary order. 

(a) Prove that all Z, are at most c\nn/(n — 1) with probability at least 1 — \/n c ^ ] . 

(b) Prove that, for sufficiently large n , there exists a constant c such that at least one 
Z/ is at least c' \nn with probability at least 1/2. (Hint: Use the second moment 
method.) 

(c) Prove that all Z, are at least \/2n 2 with probability at least 1/2. 

(d) Prove that, for sufficiently large n, there exists a constant c r such that at least one 
Z/ is at most c'fn 1 with probability at least 1/2. {Hint: Use the second moment 
method.) 

(e) Explain how these results relate to the following problem: X\,X 2 , _ X n _\ are 

values chosen independently and uniformly at random from the real interval [0,1]. 
We let Fi, F?,.. Y n ^\ represent these values in increasing sorted order, and we 
also define y 0 = 0 and Y n = \. The points Yi break the unit interval into n seg¬ 
ments. What can we say about the shortest and longest of these segments? 

Exercise 8.11: Bucket sort is a simple sorting algorithm discussed in Section 5.2.2. 

(a) Explain how to implement Bucket sort so that its expected running time is O(n) 
when the n elements to be sorted are independent, uniform random numbers that 
are chosen from [0,1J. 

(b) We now consider how to implement Bucket sort when the elements to be sorted 
are not necessarily uniform over an interval. Specifically, suppose the elements to 
be sorted are numbers of the form X + Y, where (for each element) X and Y are 
independent, uniform random numbers chosen from [0,1]. How can you modify 
the buckets so that Bucket sort still has expected running time 0(«)? What if the 
elements to be sorted were numbers of the form max(X, Y) instead of X + F? 

Exercise 8.12: Let n points be placed uniformly at random on the boundary of a circle 
of circumference 1. These n points divide the circle into n arcs. Let Z;. for \ < Z, < n 
be the length of these arcs in some arbitrary order, and let X be the number of Z ； that 
are at least \/n. Find E[XJ and Var [X]. 

Exercise 8.13: A digital camera needs two batteries. You buy a pack of n batteries, 
labeled 1 to Initially, you install batteries 1 and 2. Whenever a battery is drained, 
you immediately replace the drained battery with the lowest numbered unused battery. 
Assume that each battery lasts for an amount of time that is exponentially distributed 
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with mean /x before being drained, independent of all other batteries. Eventually, all 
the batteries but one will be drained. 

(a) Find the probability that the battery numbered i is the one that is not eventually 
drained. 

(b) Find the expected time your camera will be able to run with this pack of batteries. 


Exercise 8.14: Let X\,X 2 ,... be exponential random variables with parameter 1 . 

(a) Argue that X\ + X 2 is not an exponential random variable. 

(b) Let yv be a geometric random variable with parameter p. Prove that ^] ( N =1 X, is 
exponentially distributed with parameter p. 

Exercise 8.15: (a) Let X\,X 2 , … be a sequence of independent exponential random 
variables, each with mean 1. Given a positive real number k. let .V be defined by 

t n 

N = min n : ^ X; > k 

， =1 

That is, N is the smallest number for which the sum of the first .V of the X, is larger 
than k. Determine E[AH. 

(b) Let X\,X 2 ，... be a sequence of independent uniform random variables on the 
interval (0,1). Given a positive real number k with 0 < k < 1. let iV be defined by 

N = min |[ X, < k 


That is, N is the smallest number for which the product of the first N of the X, is 
smaller than k. Determine E[AH. (Hint: You may find Exercise 8.9 helpful.) 


Exercise 8.16: There are n tasks that are given to n processors. Each task has two 
phases, and the time for each phase is given by an exponentially distributed random 
variable with parameter 1. The times for all phases and for all tasks are independent. 
W e say that a task is half-done if it has finished one of its two phases. 


(a) Derive an expression for the probability that there are k tasks that are half-done at 
the instant when exactly one task becomes completely done. 

I b) Derive an expression for the expected time until exactly one task becomes com¬ 
pletely done. 

ic) Explain how this problem is related to the birthday paradox. 


Exercise 8.17: Prove Theorem 8.11. 


Exercise 8.18: Complete the proof of Theorem 8.13 by showing formally that, if A^i(r) 
and Ni{t) are independent, then so are N\(t) and A^ 2 (w) for any t, u >0. 
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Exercise 8.19: You are waiting at a bus stop to catch a bus across town. There are ac¬ 
tually n different bus lines you can take, each following a different route. Which bus 
you decide to take will depend on which bus gets to the bus stop first. As long as you 
are waiting, the time you have to wait for a bus on the /th line is exponentially distrib¬ 
uted with mean /x,- minutes. Once you get on a bus on the /'th line, it will take you t, 
minutes to get across town. 

Design an algorithm for deciding - when a bus arrives - whether or not you should 
get on the bus, assuming your goal is to minimize the expected time to cross town. 
{Hint: You want to determine the set of buses that you want to take as soon as they ar¬ 
rive. There are 2 〃 possible sets, which is far too large for an efficient algorithm. Argue 
that you need only consider a small number of these sets.) 


Exercise 8.20: Given a discrete space, continuous time Markov process X(t), we can 
derive a discrete time Markov chain Z(t) by considering the states the process visits. 
That is, let Z(0) = X(0), let Z(l) be the state that process X{t) first moves to after 
time t = 0, let Z(2) be the next state process X{t) moves to, and so on. (If the Markov 
process X(t) makes a transition from state i to state i, which can occur when p Li / 0 
in the associated transition matrix, then the Markov chain Z(t) should also make a 
transition from state i to state i.) 

(a) Suppose that, in the process X(t )，the time spent in state i is exponentially distrib¬ 
uted with parameter 0 t = 0 (which is the same for all i ). Further suppose that the 
process X(t) has a stationary distribution. Show that the Markov chain Z(r) has 
the same stationary distribution. 

(b) Give an example showing that, if the 0 ； are not all equal, then the stationary dis¬ 
tributions for X(t) and Z(t) may differ. 

Exercise 8.21: The Ehrenfest model is a basic model used in physics. There are n par¬ 
ticles moving randomly in a container. We consider the number of particles in the left 
and right halves of the container. A particle in one half of the container moves to the 
other half after an amount of time that is exponentially distributed with parameter 1, 
independently of all other particles. See Figure 8.6. 

(a) Find the stationary distribution of this process. 

(b) What state has the highest probability in the stationary distribution? Can you sug¬ 
gest an explanation for this? 

Exercise 8.22: We can obtain a discrete time Markov chain from the M/M/1 queue¬ 
ing process in the manner described in Exercise 8.20. The discrete time chain tracks 
the number of customers in the queue. It is useful to allow departure events to occur 
with rate 入 at the queue even when it is empty; this does not affect the queue behavior, 
but it gives transitions from state 0 to state 0 in the corresponding Markov chain. 

(a) Describe the possible transitions of this discrete-time chain and give their proba¬ 
bilities. 
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Figure 8.6: The Ehrenfest model. 

{b) Show that the stationary distribution of this chain when a < " 1 、the same as for 
the M/M/1 process. 

<c) Show that, in the case 入 =/x, there is no valid stationary di'mbuucm for the Mar¬ 
kov chain. 

Exercise 8.23: In a tandem queue, customers arrive to an M M 1 queue according 
to a Poisson process of rate 入 with service times independent and exponentially dis¬ 
tributed with parameter fi'. After completing service at this first queue, the aistomers 
proceed immediately to a second queue, also being served by a single server, w here 
、ervice times are independent and exponentially distributed with parameter Find 
the stationary distribution of this system. (Hint: Try to generalize the form of the 、 ui- 
uonary distribution for a single queue.) 

Exercise 8.24: Write a program to simulate the model of balls and bins with feedback. 

<a) Start your simulation with 51 balls in bin 1 and 49 balls in bin 2. u>ing p = 2. Run 
your program 100 times, having it stop each time one bin has 6() f r of the ball、. On 
average, how many balls are in the bins when the program stops'. 1 How often doe 、 
bin 1 have the majority? 

1 b> Perform the same experiment as in part (a) but start with 52 ball、in bin 1 and 48 
balls in bin 2. How much does this change your answers? 

ic) Perform the same experiment as in part (a) but start with 102 balls in bin 1 and 98 
balls in bin 2. How much does this change your answers? 

id) Perform the same experiment as in part (a), but now use p = 1.5. How much does 
this change your answers? 

Exercise 8.25: We consider here one approach for studying a FIFO queue with a con¬ 
stant service time of duration 1 and Poisson arrivals with parameter /. < 1. We replace 
the constant service time by k exponentially distributed service stages, each of mean 
duration l/k. A customer must pass through all k stages before leaving the queue, and 
once one customer begins going through the k stages, no other customer can receive 
■^rvice until that customer finishes. 

<a> Derive Chernoff bounds for the probability that the total time taken in k exponen¬ 
tially distributed stages, each of mean l/k, deviates significantly from 1. 

1 b) Derive a set of equations that define the stationary distribution for this situation. 
{Hint: Try letting itj be the limiting probability of having j stages of service left 
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to be served the queue. Each waiting customer requires k stages; the one being 
served requires between 1 and k stages.) You should not try to solve these equa¬ 
tions to give a closed form for tZj . 

(c) Use these equations to numerically determine the average number of customers 
in the queue in equilibrium, say for 入 = 0.8 and for k = 10, 20, 30, 40, and 50. 
Discuss whether your results seem to be converging as k increases, and compare 
the expected number of customers to an M/M/1 queue with arrival rate 入 < 1 and 
expected service time fi = \. 


Exercise 8.26: Write a simulation for a bank of n M/M/1 FIFO queues, each with 
Poisson arrivals of rate 入 < 1 per second and each with service times exponentially dis¬ 
tributed with mean 1 second. Your simulation should run for t seconds and return the 
average amount of time spent in the system per customer who completed service. You 
should present results for your simulations for n = 100 and for t = 10,000 seconds 
with k — 0.5, 0.8, 0.9, and 0.99. 

A natural way to write the simulation that we now describe is to keep a priority 
queue of events. Such a queue stores the times of all pending events, such as the next 
time a customer will arrive or the next time a customer will finish service at a queue. 
A priority queue can answer queries of the form, “What is the next event?” Priority 
queues are often implemented as heaps, for example. 

When a customer bound for queue k arrives, the arrival time for the next customer to 
queue k must then be calculated and entered in the priority queue. If queue k is empty, 
the time that the arriving customer will complete service should be put in the priority 
queue. If queue k is not empty, the customer is put at the tail of the queue. If a queue 
is not empty after completing service for a customer, then the time that the next cus¬ 
tomer (at the head of the queue) will complete service should be calculated and put in 
the priority queue. You will have to track each customer's arrival time and completion 
time. 

You may find ways to simplify this general scheme. For example, instead of con¬ 
sidering a separate arrival process for each queue, you can combine them into a single 
arrival process based on what we know from Section 8.4.2. Explain whatever simpli¬ 
fications you use. 

You may wish to use Exercise 8.9 to help construct exponentially distributed random 
variables for your simulation. 

Modify your simulation so that, instead of service times being exponentially dis¬ 
tributed with mean 1 second, they are always exactly 1 second. Again present results 
for your simulation for n = 100 and for t = 10,000 seconds with 入 = 0.5, 0.8, 0.9, 
and 0.99. Do customers complete more quickly with exponentially distributed service 
times or constant service times? 
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CHAPTER NINE 


Entropy, Randomness, 
and Information 


Suppose that we have two biased coins. One comes up heads w ith probability 3/4, and 
the other comes up heads with probability 7/8. Which coin produces more random¬ 
ness per flip? In this chapter, we introduce the entropy function as a universal measure 
of randomness. In particular, we show that the number of independent unbiased ran¬ 
dom bits that can be extracted from a sequence of biased coin flips corresponds to the 
entropy of the coin. Entropy also plays a fundamental role in information and commu¬ 
nication. To demonstrate this role, we examine some basic results in compression and 
coding and see how they relate to entropy. The main result \se prove is Shannon's cod¬ 
ing theorem for the binary symmetric channel, one of the fundamental results of the 
field of information theory. Our proof of Shannon's theorem uses several ideas that w e 
have developed in previous chapters, including Chernoff bounds. Markov's inequality, 
and the probabilistic method. 


9.1. The Entropy Function 

The entropy of a random variable is a function of its distribution that, as we shall see. 
gives a measure of the randomness of the distribution. 

Definition 9.1: 

1. The entropy in bits of a discrete random variable X is given by 
H(X) = - J]Pr(X = x)log 2 Pr(X = a). 

,V 

where the summation is over all values x in the range of X. Equivalently, we may 
write 

r 1 1 

H(X) = E log, - . 

[ Pr(X)j 
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Figure 9.1: The binary entropy function. 


2. The binary entropy function H{p) for a random variable that assumes only two 
possible outcomes, one of which occurs with probability p, is 

H(p) = -p\og 2 p - - p) log 2 (l - p). 

We define H(0) = H(\) = 0, so the binary entropy function is continuous in the inter¬ 
val [0,1], The function is drawn in Figure 9.1. 

For our two biased coins, the entropy of the coin that comes up heads with proba¬ 
bility 3/4 is 


鲁 O i = 2 - >2 3 咖 13, 

while the entropy of the coin that comes up heads with probability 7/8 is 


H 



log9 - log? — 

h 8 8 8 


3 - - log 2 7 ^ 0.5436. 


Hence the coin that comes up heads with probability 3/4 has a larger entropy. 

Taking the derivative of H(p), 

dH(p) 1 — p 

—7= _l0g2 /? + l0g 2 (l ~ P) = l0g 2 - ， 

dp P 

we see that H{p) is maximized when p = 1/2 and that H{\/2) = 1 bit. One way of 
interpreting this statement is to say: each time we flip a two-sided coin, we get out at 
most 1 bit worth of randomness, and we obtain exactly 1 bit of randomness when the 
coin is fair. Although this seems quite clear, it is not yet clear in what sense //(3/4)= 
-—I log 2 3 means that we obtain //(3/4) random bits each time we flip a coin that 
lands heads with probability 3/4. We clarify this later in the chapter. 
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As another example, the entropy of a standard six-sided die that comes up on each 
side with probability 1/6 has entropy log ： 6. In general, a random variable that has n 
equally likely outcomes has entropy 

^ 1 1 

— > —loc ，一 = loc- 11 . 

^ n ^ n ^ 

The entropy of an eight-sided die is therefore 3 bits. This result should seem quite 
natural; if the faces of the die were numbered from 0 to 7 written in binary, then the 
outcome of the die roll would give a sequence of 3 hits uniform over the set {0, l} 3 , 
which is equivalent to 3 bits generated independently and uniformly at random. 

It is worth emphasizing that the entropy of a random variable X depends not on the 
values that X can take but only on the probability distribution of X over those values. 
The entropy of an eight-sided die does not depend on w hat numbers are on the faces 
of the die; it only matters that all eight sides are equally likely to come up. This prop¬ 
erty does not hold for the expectation or variance of .V. but it does makes sense for a 
measure of randomness. To measure the randomness in a die. w e should not care about 
u hat numbers are on the faces but only about how often the die cc'imes up on each side. 

Often in this chapter we consider the entropy of a sequence of independent random 
variables, such as the entropy of a sequence of independent coin flip、For such situa¬ 
tions, the following lemma allows us to consider the entropy ot、each random variable 
to find the entropy of the sequence. 


Lemma 9.1: Let X\ and X: be independeni random variables, and lei Y = ( A’ ： 丨 . 

Then 

H(Y) = H(X0 J r H(X 2 ). 

Of course, the lemma is trivially extended by induction to the case where Y is any finite 
sequence of independent random variables. 

Proof: In what follows, the summations are to be taken over all possible values that 
can be assumed by X\ and X 2 . The result follows by using the independence of X| and 
.Y ： to simplify the expression: 

H(Y) = Pr((X,, X 2 ) = (xi,.y 2 )) log 2 Pr((Xi,X 2 ) = Ui..v 2 )) 

AI.A.2 

=—Pr(X] = a'i) Pr( Xi = a* 2 ) log 2 (Pr(X] = X]) Pr( Xi = .v：)) 

■Vi. A'2 

=—Pr(X| = A j) Pr(X 2 = A 2 )(log 2 Pr(X| = .V|) + log ： Pr( = .v^)) 

A 1. .V 2 

=—Pr(X 2 = a' 2 ) Pr(X| = .Yi) log 2 Pr(X| = .V]) 

-Vl A 2 

— ^ ^ Pr(Xi = xi) Pr (Xj = xz) log ： Pr(Xi = a'])= 


227 





ENTROPY, RANDOMNESS, AND INFORMATION 




Pr(Xi = X,) log 2 Pr(Xi = xi)J^ y Pr(X 2 = 
Pr(X 2 = x 2 ) log 2 Pr(X 2 = f Pr(X, = x,) 


~y^Pr(X] = xi)log 2 Pr(X! = x } ) - ^Pr(X 2 = x 2 )log 2 Pr(X 2 


H(X) + H{Y). 


9.2. Entropy and Binomial Coefficients 

As a prelude to showing the various applications of entropy, we first demonstrate how 
it naturally arises in a purely combinatorial context. 

Lemma 9.2: Suppose that nq is an integer in the range [0, n]. Then 

_ _< f ) < r Hiq) 

« + 1 — \nqj - . 


Proof: The statement is trivial if ^ = 0 or g = 1, so assume that 0 < ^ < 1. To prove 
the upper bound, notice that by the binomial theorem we have 




q qn (\-q) {] -^ 1 < 力 ⑴ /(I (1 1 ))" = 1. 


Hence, 


n \ < q^ qn (\ — q)~^~ q)n = 2i" log2 "2-(i-">" k)g2(1 -"> = 2 nH(q) 

飞 q) — ~ ~ 


\nq 

For the lower bound, we know that (:; q )q qll (\ — q) ( ' [ ^ q)n is one term of the expression 
Z^=() ⑺ 〆' (1 — q) n ~ k - We show that it is the largest term. Consider the difference 
between two consecutive terms as follows: 


k 


\q k (\-q) n - 


k + \r\ 


\q K± '(\-q) 


0 


This difference is nonnegative whenever 

,n - k q 


k + 1 1 - q 

or (equivalently, after some algebra) whenever 


q k {\ - q) 


> 0 


i-kL n _ k q 


/c+l \ -q 


k > qn — \ + q. 

The terms are therefore increasing up to /c = qn and decreasing after that point. Thus 
k — qn gives the largest term in the summation. 
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Since the summation has n + 1 terms and since {! l Uj )q cin {\ — g) n 一 ") " is the largest 
term, we have 

(二 V ' 1 -” 士 

or 

/ rt \ q--(\ - q r il - i/)n 2 uHu i ] ■ 


\nq) — " + 1 n + 1 

We often use the following slightly more specific corollary. 
Corollary 9.3: When 0 < q < 1/2, 


similarly, when \J2 < q < \, 


二 ) ^ 


(9.1) 


For \/2 < q < \, 


similarly, when 0 < q < 1/2. 


j I < I'iHui 

\Jnqy ~~ 

2 >lHu ” < ( » \ 
" 十 1 — k JUj △) 

, nH 叩 / \ 


” + 1 — \ \nq' 


(9.2) 


(9.3) 


(9.4) 


Proof: We first prove Eqn. (9.1); the proof of (9.2) is entirely similar. When 0 < c/ < 

1 / 2 , 


Inq], 


\q qn (\-q) 




q^ l H\-q) n 


L 叫」' 


一卜」 


< 


k d -… 1 


l. 


from which we can proceed exactly as in Lemma 9.2. 

Equation (9.3) holds because, when q > 1/2, Lemma 9.2 gives 


VL^J ； 

Equation (9.4) is derived similarly. 


2nH{lfUj\/n) 2 n H(q) 
> - > 


l, 


■ 


Although these bounds are loose, they are sufficient for our purposes. The relation be¬ 
tween the combinatorial coefficients and the entropy function arises repeatedly in the 
proofs of this chapter when we consider a sequence of biased coin tosses, where the 
coin lands heads with probability p > 1/2. Applying the Chernoff bound, we know 
that, for sufficiently large n, the number of heads will almost always be close to np. 
Thus the sequence will almost always be one of roughly ( 二） ％ 2 nH(p) sequences, 
where the approximation follows from Lemma 9.2. Moreover, each such sequence oc¬ 
curs with probability roughly 
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p n f)(l — p) n{] -P } ^ 2 一 " H(p 、. 

Hence, when we consider the outcome of n flips with a biased coin, we can essen¬ 
tially restrict ourselves to the roughly 2 nH(p) outcomes that occur with roughly equal 
probability. 

9.3. Entropy: A Measure of Randomness 

One way of interpreting the entropy of a random variable is as a measure of how many 
unbiased, independent bits can be extracted, on average, from one instantiation of the 
random variable. We consider this question in the context of a biased coin, showing 
that, for sufficiently large n, the expected number of bits that can be extracted from n 
flips of a coin that comes up heads with probability p > 1/2 is essentially nH(p). In 
other words, on average, one can generate approximately H(p) independent bits from 
each flip of a coin with entropy H(p). This result can be generalized to other random 
variables, but we focus on the specific case of biased coins here (and throughout this 
chapter) to keep the arguments more transparent. 

We begin with a definition that clarifies what we mean by extracting random bits. 

Definition 9.2: Let |v| be the number of bits in a sequence of bits y. An extraction 
function Ext takes as input the value of a random variable X and outputs a sequence 
of bits y such that 

Pr(Ext(X) = v||y|=^) = l/2' 
whenever Pr(|_v| = k) > 0. 

In the case of a biased coin, the input X is the outcome of n flips of our biased coin. 
The number of bits in the output is not fixed but instead can depend on the input. If 
the extraction function outputs k bits, we can think of these bits as having been gener¬ 
ated independently and uniformly at random, since each sequence of k bits is equally 
likely to appear. Also, there is nothing in the definition that requires that the extraction 
function be efficient to compute. We do not concern ourselves with efficiency here, 
although we do consider an efficient extraction algorithm in Exercise 9.12. 

As a first step toward proving our results about extracting unbiased bits from biased 
coins, we consider the problem of extracting random bits from a uniformly distributed 
integer random variable. For example, let X be an integer chosen uniformly at random 
from {0,.. .,7}, and let Y be the sequence of 3 bits obtained when we write X as a bi¬ 
nary number. If X = 0 then Y = 000, and if X = 7 then F = 111. It is easy to check 
that every sequence of 3 bits is equally likely to arise, so we have a trivial extraction 
function Ext by associating any input X with the corresponding output Y. 

Things are slightly harder when X is uniform over {0,..., 11}. If X < 7, then we can 
again let Y be the sequence of 3 bits obtained when we write X in binary. This leaves 
the case X e {8,9,10,11}. We can associate each of these four possibilities with a dis¬ 
tinct sequence of 2 bits, for example, by letting Y be the sequence of 2 bits obtained 


230 






9.3 ENTROPY ： A MEASURE OF RANDOMNESS 


Input 

0 

1 

2 

3 

: ^^ r~ 

4 ： 5 ： G 1 ' 

Output 

000 j 

001 

010 

Oil 

100 ! 101 i 110 1 


Input 
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1 
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Output 
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001 
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011 
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11 () 111 (HI ( 


9.2: Extraction functions for numbers that are chosen uni form 1\ at i 
111 - 


writing X - 8 as a binary number. Thus, if X = 8 then >' 
n F = 11. The entire extraction function is shown in Figure 
: e arises with the same probability 1/12, and every 2-bit sequ 
probability 1/12, so Definition 9.2 is satisfied, 
i generalize from these examples to the following theorem. 

rem 9.4: Suppose that the value of a random variable X is 
m from the integers — 1}. so that H(X) = log ； m. 

9/7 function for X that outputs on average at least [log ； 
endent and unbiased bits. 


; or iog 2 m Dits are equally iiReiy io appear, so mis i 
power of 2, then matters become more complicated 
i recursively. (A nonrecursive description is giver 
」.If X < 2 W — 1, then the function outputs the a- 
equences of a bits are equally likely to be output i 
is uniformly distributed in the set {0, ...,w — 2 a — 
...m). The extraction function can then recursive 
action function for the variable X — 2 a . 
recursive extraction function maintains the property 
ences of k bits is output with the same probability. 


.st [lo ； 
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Suppose Llog 2 (m — 2 a )J = /3, where 0 < ^ < a — \. Then (m — 2 a )/m is minimized 
when m = 2 a - 2^. Hence 

+ … i) 

0-丄(《-# + 1) 

> « — l, 

completing the induction. ■ 

We use Theorem 9.4 in our proof of the main result of this section. 

Theorem 9.5: Consider a coin that comes up heads with probability p > 1/2. For 
any constant 8 > 0 and for n sufficiently large: 

1. there exists an extraction function Ext that outputs, on an input sequence of n inde¬ 
pendent flips, an average of at least (1 — 8 )nH(p) independent random bits、and 

2. the average number of bits output by any extraction function Ext on an input se¬ 
quence of n independent flips is at most nH{ p). 

Proof: We begin by describing an extraction function that generates, on average, at 
least (1 — 8 )nH(p) random bits from n flips of the biased coin. We saw before that, 
in the case of a biased coin, the outcome of n flips is most likely be one of roughly 
2 nH{p) sequences, each occurring with probability roughly 2~ nH(p) . If we actually had 
a uniform distribution of this type, we could use the extraction function that we have 
just described for numbers chosen uniformly at random to obtain on average almost 
n H(p) uniform random bits. In what follows, we handle the technical details that arise 
because the distribution is not exactly uniform. 

There are (y) possible sequences with exactly j heads, and each of them has the 
same probability of occurring, p J {\ — p)"~ J . For each value of j, 0 < j < «, we map 
each of the (") sequences with j heads to a unique integer in the set {0,..., (y) — 1}. 
When j heads come up, we map the sequence to the corresponding number. Condi¬ 
tioned on there being j heads, this number is uniform on the integers {0,..(y) — l}, 
and hence we can apply the extraction function of Theorem 9.4 designed for this case. 
Let Z be a random variable representing the number of heads flipped, and let B be 
the random variable representing the number of bits our extraction function produces. 
Then 

n 

E[5] = ^]Pr(Z = k)E[B \ Z = k] 

k~0 

and, by Theorem 9.4, 

E[B\Z = k]> log 2 

Let s < p — \/2 represent a constant to be determined. We compute a lower bound 
for E[fi] by considering only values of k with n(p — s) < k < n{p + e). For every 
such k. 
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( 》 Uu) 


-)fiH(p+e) 


> 


« + i 


w here the last inequality follows from Corollary 9.3. Hence 


「"（/，+ 叫 

E[B] > Pr(Z = k)E[B \ Z = k] 

> J2 Pr(Z = /:)( 

k— ln( j?-e)j 



> 




H I 厂 —I 

E 


k= p- - 


Pr(Z = k) 


> (nH(p + s) — log^f // -f 1) — 1) Pr( Z — np\ < en). 

Now E[Z] = np, andPr(\Z — np\ > sn ) can be bounded by using the Chernoff bound 
of Eqn. (4.6), giving 


Hence 


Pr(jZ — np\ > sn) < 2e 


E[fiJ > (nH(p + £)- log:(" I 1) — 1)(1 — ；p ). 

We conclude that, for any constant 8 > 0. we can have 
E[fi] > (1 — 8)n H{p) 

by choosing e sufficiently small and n sufficiently large. For example, for sufficiently 
small s, 

nH(p^s) > (1 - 8/3)nH(p), 


and when n > (3p/s 2 ) ln(6/5) we have 

1 - 2Q~ nF：2/ ^ > 1 - 5/3. 

Hence, with these choices, 

E[B] > ((1 - 8/3)nH(p)- log 2 ( 打十 1) — 1)(1 — 5/3). 


As long as we now also choose n sufficiently large that (8/3)nH(p) is greater than 
log 2 (« + 1) + 1， we have 


E[B] > ((1 - 28/3)nH(p))(\ - 8/3) > (1 - 8 )nH(p). 


proving there exists an extraction function that can extract (1 — S)nH( p) independent 
and uniform bits on average from n flips of the biased coin. 

We now show that no extraction function can obtain more than nH{p) bits on av¬ 
erage. The proof relies on the following basic fact: If an input sequence 又 occurs 
with probability q ? then the corresponding output sequence Ext(jc) can have at most 
|Ext(x)| < log 2 (l/< 7 ) bits. This is because all sequences with |Ext(x)| bits would have 
probability at least q, so 
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< 1 , 


giving the desired bound on Ext (x). Given any extraction function, if B is a random vari¬ 
able representing the number of bits our extraction function produces on input X, then 

E[B] = J]Pr(X = x)|Ext(x)| 


<E Pr ^ x = x) log 2 


Pr(X = x) 


- E log 2 


Pr(X) 


= H(X). 


Another natural question to ask is how we can generate biased bits from an unbiased 
coin. This question is partially answered in Exercise 9.11. 


9.4. Compression 


A second way of interpreting the entropy value comes from compression. Again sup¬ 
pose we have a coin that comes up heads with probability p > 1/2 and that we flip 
it n times, keeping track of which flips are heads and which flips are tails. We could 
represent every outcome by using one bit per flip, with 0 representing heads and 1 rep¬ 
resenting tails, and use a total of n bits. If we take advantage of the fact that the coin is 
biased, we can do better on average. For example, suppose that p = 3/4. For a pair of 
consecutive flips, we use 0 to represent that both flips were heads, 10 to represent that 
the first flip was heads and the second tails, 110 to represent that the first flip was tails 
and the second heads, and 111 to represent that both flips were tails. Then the average 
number of bits we use per pair of flips is 


1 • 




< 2 . 


Hence, on average, we can use less than the 1 bit per flip of the standard scheme by 
breaking a sequence of « flips into pairs and representing each pair in the manner shown. 
This is an example of compression. 

It is worth emphasizing that the representation that we used here has a special prop¬ 
erty: if we write the representation of a sequence of flips, it can be uniquely decoded 
simply by parsing it from left to right. For example, the sequence 

011110 


corresponds to two heads, followed by two tails, followed by a heads and a tails. There 
is no ambiguity, because no other sequence of flips could produce this output. Our rep¬ 
resentation has this property because no bit sequence we use to represent a pair of flips 
is the prefix of another bit sequence used in the representation. Representations with 
this property are called prefix codes, which are discussed further in Exercise 9.15. 
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Compression continues to be a subject of considerable study. When storing or trans¬ 
mitting information, saving bits usually corresponds to saving resources, so finding 
ways to reduce the number of used bits by taking advantage of the data’s structure is 
often worthwhile. 

We consider here the special case of compressing the outcome of a sequence of 
biased coin flips. For a biased coin with entropy H{ p), we show (a) that the outcome 
of n flips of the coin can be represented by approximately nH{ p) bits on average and 
(b) that approximately nH(p) bits on average are necessary. In particular, any repre¬ 
sentation of the outcome of n flips of a fair coin essentially requires n bits. The entropy 
is therefore a measure of the average number of bits generated by each coin flip after 
compression. This argument can be generalized to any discrete random variable X, so 

that n independent, identically distributed random variables X|. Xi _, X n with the 

same distribution X can be represented using approximately nH(X) bits on average. 
In the setting of compression, entropy can be view ed as measuring the amount of in¬ 
formation in the input sequence. The larger the entropy of the sequence, the more bits 
are needed in order to represent it. 

We begin with a definition that clarifies what w e mean by compression in this context. 

Definition 9.3: A compression function Com lakes as input a sequence o fn coin flips, 
given as cm element of {H. T) u . and outputs ci sec/ncncc of hits such [licit each input 
sequence ofn flips yields a distinct output sequence. 

Definition 9.3 is rather weak, but it \\ ill prove sufficient for our purposes. Usually, com¬ 
pression functions must satisfy stronger requirements; for example. \ve may require a 
prefix code to simplify decoding. Using this weaker definition makes our lower-bound 
proof stronger. Also, though we are not concerned here with the efficiency of com¬ 
pressing and decompressing procedures, there are very efficient compression schemes 
that perform nearly optimally in many situations. We will consider an efficient com¬ 
pression scheme in Exercise 9.17. 

The following theorem formalizes the relationship between the entropy of a biased 
coin and compression. 


Theorem 9.6: Consider a coin that comes up heads with probability p > 1/2. For 
any constant 5 > 0, when n is sufficiently large: 

1. there exists a compression function Com such that the expected number of bits 
output by Com on an input sequence of n independent coin flips is at most 
(1 + 8)nH(p)\ and 

2 . the expected number of bits output by any compression function on an input se¬ 
quence of n independent coin flips is at least (1 — 8)nH( p). 


Theorem 9.6 is quite similar to Theorem 9.5. The lower bound on the expected num¬ 
ber of bits output by any compression function is slightly weaker. In fact, we could 
raise this lower bound to nH( p) if we insisted that the code be a prefix code - so that 
no output is the prefix of any other - but we do not prove this here. The compression 
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function we design to prove an upper bound on the expected number of output bits does 
yield a prefix code. Our construction of this compression function follows roughly the 
same intuition as Theorem 9-5. We know that, with high probability, the outcome from 
the n flips will be one of roughly 2 nH(p) sequences with roughly np heads. We can use 
about nH(p) bits to represent each one of these sequences, yielding the existence of 
an appropriate compression function. 

Proof of Theorem 9.6: We first show that there exists a compression function as guar¬ 
anteed by the theorem. Let e > 0 be a suitably small constant with p ~ s > 1/2. Let 
X be the number of heads in n flips of the coin. The first bit output by the compression 
function we use as a flag. We set it to 0 if there are at least n{p — s) heads in the se¬ 
quence and to 1 otherwise. When the first bit is a 1, the compression function uses the 
expensive default scheme, using 1 bit for each of the n flips. This requires that n + 1 
total bits be output; however, by the Chernoff bound (4.5), the probability that this case 
happens is bounded by 

Pr(X < n(p - £)) < e~ Mf2/2p . 

Now let us consider the case where there are at least n(p — e) heads. The number 
of coin-flip sequences of this form is 

< 芑("- e > 

- 2 ' 

j=\ny;>-t：) \ j=z\nyp-t：) \ 

The first inequality arises because the binomial terms are decreasing as long as j > n/2, 
and the second is a consequence of Corollary 9.3. For each such sequence of coin flips, 
the compression function can assign a unique sequence of exactly lnH(p — s)-\- log 2 行」 
bits to represent it, since 

开 （ p—fi+iogy 」 > 

- 2 ' 

Including the flag bit, it therefore takes at most«//( p — s) + log 2 n + 1 bits to represent 
the sequences of coin flips with this many heads. 

Totaling these results, we find that the expected number of bits required by the com¬ 
pression function is at most 

t~ n£l/lp (n 十 1) 十 （1 — e- ,,fl/2p 、 (nH(p -s) + log 2 « + 2) < (\^8)nH(p), 

where the inequality holds by first taking s sufficiently small and then taking n suffi¬ 
ciently large in a manner similar to that of Theorem 9.5. 

We now show the lower bound. To begin, recall that the probability that a specific 
sequence with k heads is flipped is p k {\ — p) n ~ k . Because p > 1/2, if sequence 5 1 ! has 
more heads than another sequence S 2 , then 5) is more likely to appear than S 2 . Also, 
we have the following lemma. 

Lemma 9.7: If sequence S\ is more likely than S 2 , then the compression function that 
minimizes the expected number of bits in the output assigns a bit sequence to S 2 that is 
at least as long as S\. 


E 


E 






\n(p - f)l 
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Proof: Suppose that a compression function assigns a bit sequence to Si that is shorter 
than the bit sequence it assigns to S\. We can improve the expected number of bits out¬ 
put by the compression function by switching the output sequences associated with 5] 
and S 2 , and therefore this compression function is not optimal. □ 


Hence sequences with more heads should get shorter strings from an optimal compres¬ 
sion function. 

We also make use of the following simple fact. If the compression function assigns 
distinct sequences of bits to represent each of .v coin-flip sequences, then one of the 
output bit sequences for the 5 input sequences must have length at least log 2 ^ - 1 bits. 
This is because there are at most 1 十 2 十 4 + … 1 2') = 2 h ^ ] — 1 distinct bit sequences 
with up to b bits, so if each of 5 sequences of coin flips is assigned a bit sequence of at 
most b bits, then we must have 2 h+x > 5 and hence b > log: s — 1. 

Fix a suitably small e > 0 and count the number of input sequences that have 
l(p + f) 打」 heads. There are ( L(/f> "j) sequences with _(/) 二 f )”」heads and, by 
Corollary 9.3, 

L(/? 十 s)n\) - ”卞 1 _ 

Hence any compression function must output at least log ： (2 ,:// ；i ~' (n ^ \)) — \ = 
nH(p + £) — log 2 (« + 1)—1 bits on at least one of the sequences of coin flips with 
[(p+ £■)« 」 heads. The compression function that minimizes the expected output length 
must therefore use at least this many bits to represent any sequence w ith fewer heads, 
by Lemma 9.7. 

By the Chernoff bound (4.2), the number of heads X satisfies 

Pr(X > [n{p + £)J) < Pr(X > n(p + e — \/n)) < e — ,n 一卜 $ e — 

as long as « is sufficiently large (specifically, n > 2 /e). We thus obtain, with proba¬ 
bility at least 1 — an input sequence with fewer than 卜（ f> + s ] 一 heads, and 

by our previous reasoning the compression function that minimizes the expected out¬ 
put length must still output at least ^//(/? + £) _ log 2 (« + 1) — 1 bits in this case. The 
expected number of output bits is therefore at least 

(1 — t~ ne2/]2p )(nH(p ^ s) - \og 2 (n + 1)—1). 

This can be made to be at least (1 — 8)nH(p) by first taking e to be sufficiently small 
and then taking n to be sufficiently large. ■ 


9.5.* Coding: Shannon’s Theorem 

We have seen how compression can reduce the expected number of bits required to 
represent data by changing the representation of the data. Coding also changes the 
representation of the data. Instead of reducing the number of bits required to repre¬ 
sent the data, however, coding adds redundancy in order to protect the data against loss 
or errors. 


237 




ENTROPY, RANDOMNESS, AND INFORMATION 


In coding theory, we model the information being passed from a sender to a re¬ 
ceiver through a channel. The channel may introduce noise, distorting the value of 
some of the bits during the transmission. The channel can be a wired connection, a 
wireless connection, or a storage network. For example, if I store data on a recordable 
medium and later try to read it back, then I am both the sender and the receiver, and 
the storage medium acts as the channel. In this section, we focus on one specific type 
of channel. 


Definition 9.4: The input to a binary symmetric channel with parameter p is a se¬ 
quence of bits x\,xi,..and the output is a sequence of bits Vi, >' 2 , • • -. such that 
Pr(x ； = y,-) = 1-/7 independently for each i. Informally, each bit sent is flipped to 
the wrong value independently with probability p. 

To get useful information out of the channel, we may introduce redundancy to help 
protect against the introduction of errors. As an extreme example, suppose the sender 
wants to send the receiver a single bit of information over a binary symmetric chan¬ 
nel. To protect against the possibility of error, the sender and receiver agree to repeat 
the bit n times. If p < 1/2, a natural decoding scheme for the receiver is to look at 
the n bits received and decide that the value that was received more frequently is the 
bit value the sender intended. The larger n is, the more likely the receiver determines 
the correct bit; by repeating the bit enough times, the probability of error can be made 
arbitrarily small. This example is considered more extensively in Exercise 9.18. 

Coding theory studies the trade-off between the amount of redundancy required and 
the probability of a decoding error over various types of channels. For the binary sym¬ 
metric channel, simply repeating bits may not be the best use of redundancy. Instead 
we consider more general encoding functions. 


Definition 9.5: A (k,n) encoding function Enc : {0,1}^ ^ {0,1}" takes as input 
a sequence of k bits and outputs a sequence of n bits. A (k,n) decoding function 
Dec : {0. 1}” {0. 1 }々 takes as input a sequence of n bits and outputs a sequence of 

k bits. 


With coding, the sender takes a 々 -bit message and encodes it into a block of n > k bits 
via the encoding function. These bits are then sent over the channel. The receiver ex¬ 
amines the n bits received and attempts to determine the original k-bii message using 
the decoding function. 

Given a binary channel with parameter p and a target encoding length of «, we wish 
to determine the largest value of k so that there exist (k, n) encoding and decoding 
functions with the property that, for any input sequence of k bits, with suitably large 
probability the receiver decodes the correct input from the corresponding n-b\t encod¬ 
ing sequence after it has been distorted by the channel. 

Let m e {0,1} 人 ' be the message to be sent and Enc(m) the sequence of bits sent over 
the channel. Let the random variable X denote the sequence of received bits. We re¬ 
quire that Dec(X) = m with probability at least 1 — y for all possible messages m 
and a pre-chosen constant y. If there were no noise, then we could send the original k 
bits over the channel. The noise reduces the information that the receiver can extract 
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//(1 — H(p)) bits within each block of n bits. This result is known as Shannon's theo¬ 
rem, which we prove in the following form. 

Theorem 9.8: For a binary symmetric channel with parameter p < 1/2 and for any 
constants 5, ]/ > 0, when n is sufficiently lar^e : 

1. for any k S "(1 — H(p) — 8), there exist (k. n) encoding and decoding functions 
such that the probability the receiver fails to obtain the correct message is at most 
y for every possible k-bit input message: and 

2. there are no (k, n) encoding and decoding functions with k > n(\ — H(p) 8) 
such that the probability' of decoding correctly is at k)ust 、/ for a k-bit input message 
chosen uniformly at random. 

Proof: We first prove the existence of suituble ik.m encoding and decoding functions 
when k < n( \ — H(p) — 8) by using the probabilistic method. In the end. we want our 
encoding and decoding functions to have error probability at most v on possi¬ 

ble input. We begin with a weaker result, showing that there exist appropriate coding 
functions when the input is chosen uniformly at random from all 人 .-bit inputs. 

The encoding function assigns to each of the 2' strings an //-bit codeword chosen 
independently and uniformly at random from the space of all //-bit sequences. Label 
these codewords X () . X,.. The encoding function simpl\ outputs the code¬ 

word assigned to the 々 -bit message using a large lookup table containing an entry for 
each 炙 -bit string. (You may be concerned that two codeword 、 ma\ turn out to be the 
、 ame; the probability of this is very small and is handled in the anal 〉、 i、that follows.) 

To describe the decoding function, we provide a decoding algorithm based on the 
lookup table for the encoding function, which we may assume the recei\ er possesses. 
The decoding algorithm makes use of the fact that the receiver expects the channel to 
make roughly pn errors. The receiver therefore looks for a codew ord that differs from 
the n bits received in between ( p — s)n and ( p + s)n places for some suitubl) small 
constant e > 0. If just one codeword has this property, then the receiver w ill assume 
【hat this was the codeword sent and will recover the message accordingl\. It' more than 
one codeword has this property, the decoding algorithm fails. The decoding algorithm 
described here requires exponential time and space. As in the rest of this chapter, w e 
are not now concerned with efficiency issues. 

The corresponding (k ， n) decoding function can be obtained from the algorithm by 
simply running through all possible /?-bit sequences. Whenever a sequence decodes 
properly with the foregoing algorithm, the output of the decoding function for that se¬ 
quence is set to the A:-bit sequence associated with the corresponding codew ord. When¬ 
ever the algorithm fails, the output for the sequence can be any arbitrary sequence of 
k bits. For the decoding function to fail, at least one of the two following events must 
occur: 

• the channel does not make between ( p — e)n and (p e)n errors: or 

• when a codeword is sent, the received sequence differs from some other codeword 
in between ( p — s)n and ( p + e)n places. 


239 





ENTROPY, RANDOMNESS, AND INFORMATION 


The path of the proof is now clear. A Chernoff bound can be used to show that, with 
high probability, the channel does not make too few or too many errors. Conditioning 
on the number of errors being neither too few nor too many, the question becomes how 
large k can be while ensuring that, with the required probability, the received sequence 
does not differ from multiple codewords in between ( p — e)n and (/? 十 e)n places. 

Now that we have described the encoding and decoding functions, we establish the 
notation to be used in the analysis. Let R be the received sequence of bits. For se¬ 
quences 5] and .s '2 of n bits, we write A(.vi, 52) for the number of positions where these 
sequences differ. This value A(.vi, ^ 2 ) is referred to as the Hamming distance between 
the two strings. We say that the pair (^ 1 ,^ 2 ) has weight 

w( Sl ,s 2 ) = /7 A( - Vi - 52) (1-/?) ，， ' A(Vi ^ 2) . 

The weight corresponds to the probability that ,v 2 is received when ^1 is sent over the 
channel. We introduce random variables ^o, S\,S 2 k-\ and W(), W\,.W 2 k~\ de¬ 
fined as follows. The set S t is the set of all received sequences that decode to X' t . The 
value W ； is given by 

Wj = ^2 切 (U. 

r 牵 Si 

The Si and W, are random variables that depend only on the random choices of 
X 0 , X\,.. .,X 2 k-^\. The variable W, represents the probability that, when X-, is sent, 
the received sequence R does not lie in Si and hence is decoded incorrectly. It is also 
helpful to express W-, in the following way: letting l^ s be an indicator random variable 
that is 1 if .v ^ S, and 0 otherwise, we can write 

Wi = ^ li. r w(X n r). 

r 

We start by bounding E[W,]. By symmetry, E[W,.] is the same for all /, so we bound 
E[W。]. Now 

E[W()] = E ^ /o, r w(X 0 , r) 

L r J 

=y^E[w(X 0 ,r)/o, r ]. 

V 

We split the sum into two parts. Let 7, = {.v : | A(X 0 ,— pn\ > en) and T 2 = {s : 

I A(X 0 , .v) — pn\ < en), where e > 0 is some constant to be determined. Then 

^E[uj(X 0 ,r)/ 0 .,.] = ^2 E[w(X 0 ,r)I 0 r ] + E[uj(X 0 ,r)/ 0 , r ], 

r re T\ r^Ti 

and we bound each term. 

We first bound 

E[uj(X 0 ,r)/ 0 . r ] < w(X () ,r) 

reT\ r^T\ 


p MXi) ' r \\ - 

r:\A(XQ,r)~ pn\>sn 


pyi-A{XQ,r) 


=Pr(| A(X 0 , R) - np\ > sn). 
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That is, to bound the first term, we simply bound the probability that the receiver fails 
to decode correctly and the number of errors is not in the range [(p — e)n, (p-\-e)n] by 
the probability that the number of errors is not in this range. Equivalently, we obtain 
our bound by assuming that, whenever there are too many or too few errors introduced 
by the channel, we fail to decode correctly. This probability is very small, as we can 
see by using the Chernoff bound (4.6): 

Pr(|A(X 0 ,/?) - np\ > sn) < 


For any £ > 0, we can choose n sufficiently large so that this probability, and hence 
E[w;(X(),r)/ 0ir ], is less than y/2. 

We now find an upper bound for E[ir( X". For every 厂 e 7^， the de¬ 

coding algorithm will be successful when r is received unless r differs from some other 
codeword Xj in between (p — s)n and ( p + e)n places. Hence I {) r will be 1 only if 
such an X, exists, and thus for any values of X () and r g T ： we have 

E[uj(Xo, r)/ 0 , r ] 

= w(Xo,r) Pr(for some X, with 1 < / < 2" - 1. MX,.r) — pn\ < en). 

It follows that if we obtain an upper bound 

Pr(for some X, with 1 < / < - 1. [ A( X,. n - pn < in) < y：2 


for any values of X {) and r eT 2 . then 

^ E[wj(X 0 ,r)/o.,] < ^ u'(X {) .r)^ < 

reT 2 re 7. 2 *" *" 

To obtain this upper bound, we recall that the other codewords X\. X ： . - are 

chosen independently and uniformly at random. The probability that any other specific 
codeword X；, i > 0, differs from any given string r of length n in between (p - ^)n 
and (p + £)n places is therefore at most 


L"(/? + £)」（《) 
j=ln(p-E)] 


< n 


n 


L"( " 十 d 」 


2 ” 


Here we have bounded the summation by n times its largest term: (" ) is largest when 
j = [n(p + e)J over the range of y in the summation, as long as f is chosen so that 

p -\- G <i 1 / 2. 

Using Corollary 9.3, 


(ln(p+e)l) 2 叫神 

- < - 

2n 一 2 n 

_ 2—"n —"(/?+£■>) 

Hence the probability that any specific X, matches a string r on a number of bits 
so as to cause a decoding failure is at most n 2~ n(] ~ H( P +s)) . By a union bound, the 
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probability that any of the 2 k — 1 other codewords cause a decoding failure when Xq is 
sent is at most 



where we have used the fact that k < n(\ — H(p) — <5). By choosing s small enough 
so that H(p e) — H( p) — 8 is negative and then choosing n sufficiently large, we 
can make this term as small as desired, and in particular we can make it less than y/2. 

By summing the bounds over the two sets T\ and T 2 , which correspond to the two 
types of error in the decoding algorithm, we find that EfWo] < y. 

We can bootstrap this result to show that there exists a specific code such that, if 
the message to be sent is chosen uniformly at random, then the code fails with 
probability y. We use the linearity of expectations and the probabilistic method. We 
have that 


2 k ~] 

E 


■/=() 


E[Wj] = E 


2^-1 - 


L ./' = 0 


< 2 V ， 


where again the expectation is over the random choices of the codewords Xo, X], 
X 2 a_i ，By the probabilistic method, there must exist a specific set of codewords 
xo,.^i,... such that 


E W 2V. 

,/=o 

When a k-bit message to be sent is chosen uniformly at random, the probability of 
error is 


1 2 匕1 

歹 E % 9 

,/=0 

for this set of codewords, proving the claim. 

We now prove the stronger statement in the theorem: we can choose the codewords 
so that the probability of failure for each individual codeword is bounded above by y. 
Notice that this is not implied by the previous analysis, which simply shows that the 
average probability of failure over the codewords is bounded above by y. 

We have shown that there exists a set of codewords . .. ， -^ 2 k -\ f° r which 

2 k -\ 

./=() 

Without loss of generality, let us assume that the x,. are sorted in increasing order of 
Wj. Suppose that we remove the half of the codewords that have the largest values Wj ： 
that is, we remove the codewords that have the highest probability of yielding an error 
when being sent. We claim that each x n i < 2 k ~\ must satisfy W t < 2y. Otherwise 
we would have 
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2^-1 

[W, >2 k - l (2y) = 2 k y, 

户 2 々 -1 

a contradiction. ( We used similar reasoning in the proof of Markov’s inequality in Sec¬ 
tion 3.1.) 

We can set up new encoding and decoding functions on all (k — l)-bit strings using 
just these 2 k ~ ] codewords, and now the error probability for every codeword is simul¬ 
taneously at most 2y. Hence we have shown that, when k — \ < n(\ — H( p) - 8), 
there exists a code such that the probability that the receiver fails to obtain the correct 
message is at most 2y for any message that is sent. Since 8 and y were arbitrary con¬ 
stants, let = y/2 and <5 ; = <5/2. Then we have k — \ < n(\ — H{p) — 2<5 , ), which 
implies that k < n(\ — H(p) — 8'), and so the probability that the decoding fails for 
any individual codeword is bounded by y'. This is exactly the statement of the first half 
of the theorem, with y' and 8' in place of y and <5, and hence this part of the theorem 
has been proven. 

Having completed the first half of the theorem, we now move to the second half: for 
any constants <5, y > 0 and for n sufficiently large, there do not exist (k.n ) encoding 
and decoding functions with k > n{\ — H(p)-\-8) such that the probahilit\ of decoding 
correctly is at least y for a k-bit input message chosen uniform 1_\ at random. 

Before giving the proof, let us first consider some helpful intuition. We know that 
the number of errors introduced by the channel is. with high probability, between 
\{p — £)n] and [_(/? + £'>«」for a suitable constant ^ > 0. Suppose that we tr\ to set up 
the decoding function so that each codeword is decoded pro per 1> w he never the num¬ 
ber of errors is between (p - e)n and (p -r m Then each codeword is associated 
with 



^ntfi p I 


n + 1 


bit sequences by the decoding function: the last inequality follows from Corollary 9.3. 
But there are 2 k different codewords, and 


， nH\ p 、 , ti H{ !)、 

， k 二 _ > ，川 ("丨 > 2 n 

*" n + 1 - 一 n + 1 

when n is sufficiently large. Since there are only 2 〃 possible bit sequences that can be 
received, we cannot create a decoding function that always decodes properly whenever 
the number of errors is between ( p — f)n and (p + e)n. 

We now need to extend the argument for any encoding and decoding functions. This 
argument is more complex, since we cannot assume that the decoding function neces¬ 
sarily tries to decode properly whenever the number of errors is between {p — e)n and 
(p + e)n, even though this would seem to be the best strategy to pursue. 

Given any fixed encoding function with codewords .vq, X],..., x 2 ^\ and any fixed 
decoding function, let z be the probability of successful decoding. Define 5, to be the 
set of all received sequences that decode to .Xj. Then 
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z = ^ ^Pr((x, is sent) n (/? = s)) 

/ =0 seSi 
2 k -] 

=^ ^ Pr(x,- is sent) Pr(R = 5 | x, is sent) 

i=0 seS, 

1 2i —' 

=p []Pr ( 穴 = 、 .U, is sent) 

i = 0 s e Si 

1 2i — 1 

/ =0 s&Si 

The second line follows from the definition of conditional probability. The third line 
uses the fact that the message sent and hence the codeword sent is chosen uniformly at 
random from all codewords. The fourth line is just the definition of the weight func¬ 
tion. 

To bound this last line, we again split the summation Ylses w ( x 'n s ) into two 

parts. Let 5/j = {s e S ； : | A(x,-, .v) — pn\ > en) and S；_j = e S-, : | A(x,-, a) — pn\ < 
en}, where again s > 0 is some constant to be determined. Then 


Now 


^ w(Xi,s) = ^ + ^ w(x ； ,s). 

seSi,\ seSj^i 

w(Xi,s) < [ wix^s), 

se Sj ,I ,s: \A(.Xj.s) — j)fi\>en 


which can be bounded using Chernoff bounds. The summation on the right is sim¬ 
ply the probability that the number of errors introduced by the channel is not between 
(p — s)n and ( p + s)n, which we know from previous arguments is at most 2e~ F ~ n/3p . 
This bound is equivalent to assuming that decoding is successful even if there are too 
many or too few errors introduced by the channel; since the probability of too many or 
too few errors is small, this assumption still yields a good bound. 

To bound Yls&s ^ we note that w(x n s) is decreasing in A(x,-, 5 ). Hence, 

for 5 g Si, 2 , 

w(x n s) < p { P- £)n (\ - pf-P+^> 


P pn (\ - pf- p)n \ 


Therefore, 


2—"( /?) 


w(xi,s) < 


-H(p)n( 1 - P 
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We continue with 


2 k 


[[ w(Xi ， s) 


--0 .seSi 


2 k 


w(x,,S) + [ U'(.Xi ， s) 


seS,, 


^ ^ > |Ze 


■-ii/^p , -s-Hi put I 


—P 


队 2 l 


2e —A A , + (LZZ ) ^|5 ； 2 | 

\ P ’ ，— 11 


< 2e— e2n/3p H_ 2~ H(pU 

_ 2 k 


P 


P 


In this last line, we have used the important fact that the sets of bit sequences S, and 
hence all the Si '2 are disjoint, so their total size is at most 2 n . This is where the fact 
that we are using a decoding function comes into play, allowing us to establish a useful 
bound. 

To conclude, 

' ~p\ n 


z < 2e- 门 , /3/, 
= 2t~ e2 ' l/3p 


.1— Hi — Hi p) 


P 


P 


P 


As long as we choose s sufficiently small that 

"1“ 


2— 0 < 1 . 


P 


then, when n is sufficiently large, z < y, which proves Theorem 9.8. 


Shannon’s theorem demonstrates the existence of codes that transmit arbitrarily closely 
to the capacity of the binary symmetric channel over long enough blocks. It does not 
give explicit codes, nor does it say that such codes can be encoded and decoded ef¬ 
ficiently. It took decades after Shannon’s original work before practical codes with 
near-optimal performance were found. 


9.6. Exercises 

Exercise 9.1: (a) Let S = Consider a random variable X such that 

Pr(X = ^) = 1 /Sk 2 for integers k = 1 ， ... ， 10. Find H{X). 

(b) Let S = J2k=\ Consider a random variable X such that Pr(X = k)= 
1/ 从 3 for integers 左 ： Find H(X). 
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(c) Consider S a = Ylk=\ Vk a ， where a > 1 is a constant. Consider random vari¬ 
ables X a such that Pr(X a = k) = \! S a k a for integers 々 =1” . • ， 10. Give an intuitive 
explanation explaining whether H(X a ) is increasing or decreasing with a and why. 


Exercise 9.2: Consider an n-sided die, where the /th face comes up with probability 
p,. Show that the entropy of a die roll is maximized when each face comes up with 
equal probability \fn. 


Exercise 9.3: (a) A fair coin is repeatedly flipped until the first heads occurs. Let X 
be the number of flips required. Find H( X). 

(b) Your friend flips a fair coin repeatedly until the first heads occurs. You want to 
determine how many flips were required. You are allowed to ask a series of yes-no 
questions of the following form: you give your friend a set of integers, and your friend 
answers “yes” if the number of flips is in that set and “no” otherwise. Describe a strat¬ 
egy so that the expected number of questions you must ask before determining the 
number of flips is H(X). 

(c) Give an intuitive explanation of why you cannot come up with a strategy that 
would allow you to ask fewer than H{X) questions on average. 

Exercise 9.4: (a) Show that 



k\n 2 k 


is finite. 

(b) Consider the integer-valued discrete random variable X given by 


Pr(X = k )= 


Sk In 2 k 


k>2. 


Show that H{X) is unbounded. 


Exercise 9.5: Suppose p is chosen uniformly at random from the real interval [0,1], 
Calculate E[H(p)\. 

Exercise 9.6: The conditional entropy H(Y \ X) is defined by 

H(Y I X) = ^Pr((X = X) n (7 = v))log 2 Pr(r = v \ X = x). 

X . \' 

If Z = (X, F), show that 


H(Z) = H(X) + H(Y I X). 


Exercise 9.7: One form of Stirling’s formula is 



< «! < 





Using this, prove 
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which is a tighter bound than that of Lemma 9.2. 

Exercise 9.8: We have shown in Theorem 9.5 that we can use a recursive procedure 
to extract, on average, at least [jog 2 m 」一 1 independent, unbiased bits from a num¬ 
ber X chosen uniformly at random from 5" = (0. m — 1}. Consider the following 

extraction function: let a = [log 2 mj, and write 

m = ^ a 2 a + + • • • + ^()2°, 

where each 0,. is either 0 or 1. 

Let k be the number of values of i for which equals 1. Then we split S into k dis¬ 
joint subsets in the following manner: there is one set for each value of 氏 that equals 
1, and the set for this / has T elements. The assignment of S to sets can be arbitrary, 
as long as the resulting sets are disjoint. To get an extraction function, we map the ele¬ 
ments of the subset with 2 l elements in a one-to-one manner with the 2' binary strings 
of length i. 

Show that this mapping is equivalent to the recursive extraction procedure given in 
Theorem 9.5 in that both produce / bits with the same probability for all i. 

Exercise 9.9: We have shown that we can extract, on average, at least [log ： 川」 一 1 inde¬ 
pendent, unbiased bits from a number chosen uniformly at random from {(). 〃，一 1}. 

It follows that if we have k numbers chosen independently and uniformly at random 
from {0,..., m — 1} then we can extract, on average, at least k : jog: m\ — k indepen¬ 
dent, unbiased bits from them. Give a better procedure that extracts, on average, at 
least /:[log 2 m」_ 1 independent, unbiased bits from these numbers. 

Exercise 9.10: Suppose that we have a means of generating independent, fair coin flips. 

(a) Give an algorithm using the coin to generate a number uniformly from {0.1. 

n — 1}, where n is a power of 2, using exactly logo n flips. 

(b) Argue that, if n is not a power of 2, then no algorithm can generate a number uni¬ 
formly from {0,1, n - \) using exactly k coin flips for any lixed k. 

(c) Argue that, if n is not a power of 2, then no algorithm can generate a number uni¬ 
formly from — 1} using at most k coin flips for any fixed k. 

(d) Give an algorithm using the coin to generate a number uniformly from {().1. 

n — 1}, even when n is not a power of 2, using at most 2 「 log 2 "1 expected flips. 

Exercise 9.11: Suppose that we have a means of generating independent, fair coin flips. 

(a) Give an algorithm using the fair coin that simulates flipping a biased coin that 
comes up heads with probability p. The expected number of flips your algorithm 
uses should be at most 2. {Hint: Think of p written as a decimal in binary, and 
use the fair coin to generate binary decimal digits.) 
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X HHTTHTHHHTHHHTTTHTTT 


Y H T H H T T 

Z HHTHTHTHTH 


Y HTHHTT Z HHTHTHTHTH 


H T 
T H H 


H 

H T T T T 


Figure 9.3: After running A on the input sequence X, we can derive further sequences Y and Z ； 
after running A on each of Y and Z, we can derive further sequences from them; and so on. 


(b) Give an algorithm using the coin to generate a number uniformly from {0,1,..., 
n — 1}. The expected number of flips your algorithm uses should be at most 
riog 2 «i + 2. 

Exercise 9.12: Here is an extraction algorithm A whose input is a sequence X = 
X],x 2 ,..., x fl of n independent flips of a coin that comes up heads with probability 
p > 1/2. Break the sequence into L«/2 」 pairs, a,- = (x 2 /-i,x 2; ') for i = 1, [n/2\. 
Consider the pairs in order. If v, = (heads, tails) then output a 0; if = (tails, heads) 
then output a 1; otherwise, move on to the next pair. 


(a) Show that the bits extracted are independent and unbiased. 

( b ) Show that the expected number of extracted bits is [_ 打 / 2 」 (1 - P ) ~ np(\ - p). 

(c) We can derive another set of flips Y = _v 丨，） ’2, ... from the sequence X as follows. 
Start with j,k = 1. Repeat the following operations until j = L«/2 」： If «/ = 
(heads, heads), set y k to heads and increment j and k: if aj = (tails, tails), set 
Va ； to tails and increment j and k; otherwise, increment j. See Figure 9.3 for an 
example. 

The intuition here is that we take some of the randomness that A was unable 
to use effectively and re-use it. Show that the bits produced by running AonY 
are independent and unbiased, and further argue that they are independent of those 
produced from running A on X. 

(d) We can derive a second set of flips Z = zi ， Z 2 , from the sequence X as 

follows: let z，i be heads if >/ = (heads, heads) or (tails, tails), and let n be tails 
otherwise. See Figure 9.3 for an example. Show that the bits produced by running 
Aon Z are independent and unbiased, and further argue that they are independent 
of those produced from running on X and Y. 

(e) After we derive and run 乂 on 1" and Z, we can recursively derive two further se¬ 
quences from each of these sequences in the same way, run A on those, and so on. 
See Figure 9.3 for an example. Let A( p) be the average number of bits extracted 
for each flip (with probability p of coming up heads) in the sequence X, in the 
limit as the length of the sequence X goes to infinity. Argue that A(p) satisfies the 
recurrence 
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Mp) = p(l — p) + -(p 2 ~t \ + -A(p 2 + (1 — p) 2 ). 

2 \p 2 +q 2 J 2 

(f) Show that the entropy function H( p) satisfies this recurrence for A(p). 

(g) Implement the recursive extraction procedure explained in part (e). Run it 1000 
times on sequences of 1024 bits generated by a coin that comes up heads with prob¬ 
ability p — 0.7. Give the distribution of the number of flips extracted over the 1000 
runs and discuss how close your results are to 1024 • H(0.7). 

Exercise 9.13: Suppose that, instead of a biased coin, we have a biased six-sided die 
with entropy h > 0. Modify our extraction function for the case of biased coins so that 
it extracts, on average, almost h random bits per roll from a sequence of die rolls. Prove 
formally that your extraction function works by modifying Theorem 9.5 appropriately. 

Exercise 9.14: Suppose that, instead of a biased coin, we have a biased six-sided die 
with entropy h > 0. Modify our compression function for the case of biased coins so 
that it compresses a sequence of n die rolls to almost ?ih bits on average. Prove for¬ 
mally that your compression function works by modifying Theorem 9.6 appropriately. 

Exercise 9.15: We wish to compress a sequence of independent, identically distrib¬ 
uted random variables X,, X 2 .Each X, takes on one of n values. We map the ith 

value to a codeword, which is a sequence of i, bits. We wish these codewords to have 
the property that no codeword is the prefix of any other codeword. 

(a) Explain how this property can be used to easily decompress the string created by 
the compression algorithm when reading the bits sequentially. 

(b) Prove that the £ / must satisfy 

亡 2'. 5 1. 

/ =i 

This is known as the Kraft inequality. 

Exercise 9.16: We wish to compress a sequence of independent, identically distrib¬ 
uted random variables H,_ Each Xj takes on one of n values. The ith value 

occurs with probability p,-, where p\ > pi > ■ ■ ■ > Pn- The result is compressed as 
follows. Set 

/-I 

T i = 1 ] P.i' 

J=\ 

and let the ith codeword be the first 「 log 2 (l/p,.)l bits of 7). Start with an empty string, 
and consider the Xj in order. If Xj takes on the ith value, append the ith codeword to 
the end of the string. 

(a) Show that no codeword is the prefix of any other codeword. 

(b) Let ^ be the average number of bits appended for each random variable Xj. Show 
that 

H(X)<z<H(X) + \. 
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Exercise 9.17: Arithmetic coding is a standard compression method. In the case where 
the string to be compressed is a sequence of biased coin flips, it can be described as 
follows. Suppose that we have a sequence of bits X = (X\,X 2 , ..X„), where each 
Xj is independently 0 with probability p and 1 with probability 1 - p. The sequences 
can be ordered lexicographically, so for x = (尤丨 ， x:,) and v =()， 丨 ， _V 2 ,V, ) 
we say x < y if x, = 0 and y, = 1 in the first coordinate i such that Xj ^ y/. If z A is 
the number of zeroes in the string x, then define p(x) = — p)"~ Zx and q{x )= 

(a) Suppose we are given X — (X|, Xj, .. X u ) sequentially. Explain how to com¬ 
pute q(X) in time O(n). (You may assume that any operation on real numbers 
takes constant time.) 

(b) Argue that the intervals [q(.x),q(x) + p(x)) are disjoint subintervals of [0,1). 

(c) Given (a) and (b), the sequence X can be represented by any point in the in¬ 
terval [g(X) ， < 7 (X ) 十 p(X)). Show that we can choose a codeword in [q(X), 
q(X) + p(X)) with 「 log2(l//?(X))1 + 1 binary decimal digits to represent X in 
such a way that no codeword is the prefix of any other codeword. 

(d) Given a codeword chosen as in (c), explain how to decompress it to determine the 
corresponding sequence (X\,X 2 ,..., X n ). 

(e) Using a Chernoff bound, argue that log 2 (l//?(X)) is close to nH(p) with high 
probability. Hence this approach yields an effective compression scheme. 

Exercise 9.18: Alice wants to send Bob the result of a fair coin flip over a binary 
symmetric channel that flips each bit with probability p < 1/2. To avoid errors in 
transmission, she encodes heads as a sequence of 2k + 1 zeroes and tails as a sequence 
of 2k + 1 ones. 

(a) Consider the case where k = \, so heads is encoded as 000 and tails as 111. For 
each of the eight possible sequences of 3 bits that can be received, determine the 
probability that Alice flipped a heads conditioned on Bob receiving that sequence. 

(b) Bob decodes by examining the 3 bits. If two or three of the bits are 0, then Bob de¬ 
cides the corresponding coin flip was a heads. Prove that this rule minimizes the 
probability of error for each flip. 

(c) Argue that, for general k. Bob minimized the probability of error by deciding the 
flip was heads if at least k + 1 of the bits are 0. 

(d) Give a formula for the probability that Bob makes an error that holds for general 
k. Evaluate the formula for p = 0A and k ranging from 1 to 6. 

(e) Give a bound on the probability computed in part (d) using Chernoff bounds. 


Exercise 9.19: Consider the following channel. The sender can send a symbol from 
the set {0,1,2, 3,4}. The channel introduces errors; when the symbol k is sent, the 
recipient receives k + 1 mod 5 with probability 1/2 and receives k — 1 mod 5 with prob¬ 
ability 1/2. The errors are mutually independent when multiple symbols are sent. 

Let us define encoding and decoding functions for this channel. A ( j,n) encoding 
function Enc maps a number in {0,1,..., y — 1} into sequences from {0,1,2,3,4} ,; , 
and a ( j , n ) decoding function Dec maps sequences from {0,1, 2, 3,4} w back into 
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{0, 1， — 1}. Notice that this definition is slightly different than the one we used for 
bit sequences over the binary symmetric channel. 

There are (1 ， 1) encoding and decoding functions with zero probability of error. The 
encoding function maps 0 to 0 and 1 to 1. When a 0 is sent, the receiver will receive 
either a 1 or 4, so the decoding function maps 1 and 4 back to 0. When a 1 is sent, the 
receiver will receiver either a 2 or 0, so the decoding function maps 2 and 0 back to 1. 
This guarantees that no error is made. Hence at least one bit can be sent without error 
per channel use. 

( a ) Show that there are (5. 2) encoding and decoding functions with zero probability 
of error. Argue that this means more than one bit of information can be sent per 
use of the channel. 

(b) Show that if there are (jji ) encoding and decoding functions with zero probabil¬ 
ity of error, then n > log 2 ,//(log 2 5 — 1). 


Exercise 9.20: A binary erasure channel transfers a sequence of n bits. Each bit either 
arrives successfully without error or fails to arrive successfully and is replaced by a ‘？’ 
symbol, denoting that it is not known if that bit is a 0 or a 1. Failures occur indepen¬ 
dently with probability p. We can define {kji) encoding and decoding functions for 
the binary erasure channel in a similar manner as for the binary symmetric channel, ex¬ 
cept here the decoding function Dec : {0,1 ，？} "一 {0, 1} 人 . must handle sequences with 
the ‘？’ symbol. 

Prove that, for any p > 0 and any constants S. y > 0. if n is sufficiently large then 
there exist (k, n) encoding and decoding functions with k < n( \ — p — S) such that the 
probability that the receiver fails to obtain the correct message is at most y for every 
possible k-b\X input message. 

Exercise 9.21: In proving Shannon'stheorem, we usedthe following decoding method: 
Look for a codeword that differs from the received sequence of bits in between (p — s)n 
and (p + s)n places, for an appropriate choice of e; if there is only one such codeword, 
the decoder concludes that that codeword was the one sent. Suppose instead that the 
decoder looks for the codeword that differs from the received sequence in the smallest 
number of bits (breaking ties arbitrarily), and concludes that that codeword was the one 
sent. Show how to modify the proof of Shannon’s theorem for this decoding technique 
to obtain a similar result. 
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The Monte Carlo Method 


The Monte Carlo method refers to a collection of tools for estimating values through 
sampling and simulation. Monte Carlo techniques are used extensively in almost all 
areas of physical sciences and engineering. In this chapter, we first present the basic 
idea of estimating a value through sampling, using a simple experiment that gives an 
estimate of the value of the constant 7T. Estimating through sampling is often more com¬ 
plex than this simple example suggests. We demonstrate the potential difficulties that 
can arise in devising an efficient sampling procedure by considering how to appropri¬ 
ately sample in order to estimate the number of satisfying assignments of a disjunctive 
normal form (DNF) Boolean formula. 

We then move to more general considerations, demonstrating a general reduction 
from almost uniform sampling to approximate counting of combinatorial objects. This 
leads us to consider how to obtain almost uniform samples. One method is the Markov 
chain Monte Carlo (MCMC) technique, introduced in the last section of this chapter. 


10.1. The Monte Carlo Method 


Consider the following approach for estimating the value of the constant 丌 . Let (X, Y ) 
be a point chosen uniformly at random in a 2 x 2 square centered at the origin (0,0). 
This is equivalent to choosing X and Y independently from a uniform distribution on 
[-1,1]. The circle of radius 1 centered at (0,0) lies inside this square and has area n. 
If we let 


_ I 1 if Vx 2 + Y 2 < 1, 

( 0 otherwise, 

then — because the point was chosen uniformly from the 2 x 2 square — the probabil¬ 
ity that Z = 1 is exactly the ratio of the area of the circle to the area of the square. See 
Figure 10.1. Hence 

Pr(Z — 

4 
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(- 1 ,- 1 ) 


Figure 10.1; A point chosen uniformly at random in the 
square has probability 7r/4 of landing in the circle. 

Assume that we run this experiment m times (with X and Y chosen independently 
among the runs), with Z,. being the value of Z at the ith run, If U.= [: 乙 ,then 

- m - 

E[W] = E 

- / 二 1 - 

and hence W' = (4/m) W is a natural estimate for it. Applying the Chemoff bound of 
Eqn. (4.6), we compute 

Pr(| W’ - 7i| > S7T) = Pr 

= Pr(\W -E[W]\ > eE[U']) 

< 2e _W7T£ ^ 12 

Therefore, by using a sufficiently large number of samples we can obtain, with high 
probability, as tight an approximation of tt as we wish. 

This method for approximating tt is an example of a more general class ot approxi¬ 
mation algorithms that we now characterize. 

Definition 10.1: A randomized algorithm gives an {s, 5)-approximation for the value 
V if the output X of the algorithm satisfies 

Pr(|X -V\<sV)>\-8. 

Our method for estimating tt gives an (s, 5)-approximation, as long as f < 1 and we 
choose m large enough to make 

2e-' 謹 2/12 $ 5. 

Algebraic manipulation yields that choosing 

121n(2/5) 
m > --- 

is sufficient. 

We may generalize the idea behind our technique for estimating 丌 to provide a rela¬ 
tion between the number of samples and the quality of the approximation. We use the 
following simple application of the Chernoff bound throughout this chapter. 




m 丌 

丁 


tnn 

=[E[Z,] = 丁 . 
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Theorem 10.1: Let X\,..X m be independent and identically distributed indicator 
random variables, with /i = E[X,]. If m > (3 \n(2f8))/s 1 fi, then 

Pr( — > : X/ — ji > I < 5. 

That is, m samples provide an (e, 8)-approximation for /j,. 

The proof is left as Exercise 10.1. 

More generally, we will want an algorithm that approximates not just a single value 
but instead takes as input a problem instance and approximates the solution value for 
that problem. Here we are considering problems that map inputs i to values V(x). For 
example, given an input graph, we might want to know an approximation to the num¬ 
ber of independent sets in the graph. 

You might ask why we should settle for an approximation; perhaps we should aim 
for an exact answer. In the case of 丌 ， we cannot obtain an exact answer because it is 
an irrational number. Another reason for seeking an approximation is that, as we shall 
see shortly, there are problems for which the existence of an algorithm that gives an ex¬ 
act answer would imply that P = NP, and hence it is unlikely that such an algorithm 
will be found. This, however, does not preclude the possibility of an efficient approxi¬ 
mation algorithm. 

Definition 10.2: A fully polynomial randomized approximation scheme (FPRAS)/or 
a problem is a randomized algorithm for which, given an input x and any parameters 
e and 8 with 0 < e, 5 < 1, the algorithm outputs an {e, 8)-approximation to V(x) in 
time that is polynomial in l/s, In 5 _l , and the size of the input x. 

Exercise 10.3 considers a seemingly weaker but actually equivalent definition of an 
FPRAS that avoids the parameter 8. 

The Monte Carlo method essentially consists of the approach we have outlined here 
to obtain an efficient approximation for a value V. We require an efficient process 
that generates a sequence of independent and identically distributed random samples 

X],X 2 , _ X n such that E[X,J = V. We then take enough samples to get an (e, 8)- 

approximation to V. Generating a good sequence of samples is often a nontrivial task 
and is a major focus of the Monte Carlo method. 

The Monte Carlo method is also sometimes called Monte Carlo simulation. As an 
example, suppose we want to estimate the expected price of a stock sometime in the 
future. We may develop a model where the price p(Y \,..., Y k ) of the stock at that time 
depends on random variables Y\,Y 2 ,.. .,Y k . If we can repeatedly generate indepen¬ 
dent random vectors (vi, » ， •.. ， Vk ) from the joint distribution of the Yj, then we can 
repeatedly generate independent random variables X\,Xi ,..where 

X t = p(Y u ...,Y k ). 

We can then use the X t to estimate the expected future price E[p(Y\, ... ， h)] with the 
Monte Carlo method. That is, by simulating the possible future outcomes of the Yj 
many times, we can estimate the desired expectation. 
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10.2. Application: The DNF Counting Problem 

As an example of an estimation problem that requires a nontrivial sampling technique, 
we consider the problem of counting the number ot'satisfying assignments of a Boolean 
formula in disjunctive normal form (DNF). A DNF formula is a disjunction (OR) of 
clauses Ci v C 2 v • • • v C,, where each clause is a conjunction (AND > ot' literals. For 
example, the following is a DNF formula; 

(X] A A X3) V {X 2 A X4) V (. 7 "] ' .V ； ' .V 二 I. 

Recall from Section 6.2.2 that, in a standard satisfiability problem, the input formula 
is a conjunction (AND) of a set of clauses, and each clause i、the di^ unction lORiot’ 
literals. This is commonly called conjunctive normal form 1 CNF 1 . While determining 
the satisfiability of a formula in CNF form is difficult, determining the 、山 、 liability ot' 
a formula in DNF form is simple. Since a satisfying assignment for a D\F Uirmula 
needs to satisfy only one clause, it is easy to find a satisfying a、、ignrnent or prose that 
it is not satisfiable. 

How hard is it to exactly count the number of satisfying assignment 、 ot. a DNF for¬ 
mula? Given any CNF formula //, we can apply de Morgan's Lpa 、 u) obtain a DNF 
formula for H, the negation of the formula H, with the same number of \anables and 
clauses as the original CNF formula. The formula H has a sati>t \mg assignment it 
and only if there is some assignment for the variables that does not 、 an 、 f> H. Thus. H 
has a satisfying assignment if and only if the number of satisfying alignment 、" 卜 
strictly less than 2”， the total number of possible assignments for n Boolean \anable-'. 
We conclude that counting the number of satisfying assignments ol' a DNF formula is 
at least as hard as solving the NP-complete problem SAT. 

There is a complexity class associated with the problem ot' counting M'lutu'n、to 
problems in NP, denoted by ttP and pronounced ‘‘sharp-P，’. Fonnall). a problem is m 
the class jP if there is a polynomial time, nondeterministic Turing machine、udi that, 
tor any input /, the number of accepting computations equals the number ot' different 
solutions associated with the input /. Counting the number of sat i sty mg a 、、 ignment 、 
of a DNF formula is actually t|P-complete; that is, this problem is as hard a 、 an> other 
problem in this class. Other complete problems for the class : P include L.ountmg the 
number of Hamiltonian cycles in a graph and counting the number ot perfect match¬ 
ings in a bipartite graph. 

It is unlikely that there is a polynomial time algorithm that computes the exact num¬ 
ber of solutions of a JJP-complete problem, as at the very least such an algorithm would 
imply that P = NP. It is therefore interesting to find an FPRAS tor the number of sat¬ 
isfying assignments of a DNF formula. 

10.2.1. The Naive Approach 

We start by trying to generalize the approach that we used to approximate tt, and we 
demonstrate why it is unsuitable in general. We then show how to improve our sam¬ 
pling technique in order to solve the problem. 
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DNF Counting Algorithm I: 

Input: A DNF formula F with n variables. 

Output: y = an approximation of c(F). 

1. X ^0. 

2. For k = 1 to m ， do: 

( a ) Generate a random assignment for the n variables, chosen uniformly at 
random from all 2" possible assignments. 

(b) If the random assignment satisfies F, then X X + 1. 

3. Return Y ^ (X/m)2 n . 


Algorithm 10.1: DNF counting algorithm I. 


Let c(F) be the number of satisfying assignments of a DNF formula F. Here we 
assume that c(F) >0, since it is easy to check whether c(F) = 0 before running 
our sampling algorithm. In Section 10.1 we approximated it by generating points uni¬ 
formly at random from the 2x2 square and checking to see if they were in the target: 
a circle of radius 1. We try a similar approach in Algorithm 10.1: we generate assign¬ 
ments uniformly at random for the n variables and then see if the resulting assignment 
is in the target of satisfying assignments for F. 

Let X /； be 1 if the kth iteration in the algorithm generated a satisfying assignment and 
0 otherwise. Then X = X^, where the are independent 0-1 random variables 

that each take the value 1 with probability c(F)/2 n . Hence, by linearity of expectations, 


E[F] = 


E[X]2" 


m 


= c(F). 


Applying Theorem 10.1，we see that X/m gives an (e, 5)-approximation of c(F)/2 n , 
and hence that Y gives an (e, 5)-approximation of c(F), when 


m > 


3.2Mn(2/5) 
~ s 2 c(F) 


If c{F) > 2 n /a(n) for some polynomial a, then the foregoing analysis tells us we 
only need a number of samples m that is polynomial in n, \/s, and ln(l/5). We cannot, 
however, exclude the possibility that c(F) is much less than 2". In particular, c(F) 
might be polynomial in n. Since our analysis requires a number of samples m that is 
proportional to 2 n /c{F ), our analysis does not yield that the run time of the algorithm 
is always polynomial in the problem size. 

This is not simply an artifact of the analysis. We provide a rough sketch of an ar¬ 
gument that is elaborated in Exercise 10.4. If the number of satisfying assignments is 
polynomial in n and if at each step we sample uniformly at random from all 2 n pos¬ 
sible assignments, then with high probability we must sample an exponential number 
of assignments before finding the first satisfying assignment. We can conclude, for 
example, that we cannot distinguish between instances with « ， n 2 , and n 3 satisfying 
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assignments without considering exponentially many random assignments, since with 
high probability we would obtain zero satisfying assignments in all three cases. 

The problem with this sampling approach is that the set of satisfying assignments 
might not be sufficiently dense in the set of all assignments. This is an additional re¬ 
quirement of our sampling technique that was not explicit before. In the phrasing of 
Theorem 10.1，the value fi that we are attempting to approximate needs to be suffi¬ 
ciently large that sampling is efficient. 

To obtain an FPRAS for this problem, we need to devise a better sampling scheme 
that avoids wasting so many steps on assignments that do not satisfy the formula. We 
need to construct a sample space that includes all the satisfying assignments of F and, 
moreover, has the property that these assignments are sufficiently dense in the sample 
space to allow for efficient sampling. 


10.2.2. A Fully Polynomial Randomized Scheme for DNF Counting 


We now revise our sampling procedure to obtain an FPRAS. Let F = Ci VC 2 v • • • v C r , 
and assume without loss of generality that no clause includes a variable and its nega¬ 
tion. (If there is such a clause, it is not satisfiable and we can eliminate it from the 
formula.) A satisfying assignment of F needs to satisfy at least one of the clauses 
C\,... ,C f . Each clause is a conjunction of literals, so there is only one assignment of 
the variables appearing in the clause that satisfies the clause. All other variables can 
have arbitrary values. For example, for the clause ('\.i a . v ： a a 、） to be satisfied, .vi and 
.V 3 must be set to True and a .2 must be set to False. 

It follows that if clause C, has literals then there are exactly 2" — fl satisfying as¬ 
signments for C,. Let SC, denote the set of assignments that satisfy clause ?, and let 

U = {(i.a) \ \ < i < t and a e 5C,.}. 

Notice that we know the size of U. since 

! 

^|5C,| = \Ul 


and we can compute \SCj\. 

The value that we want to estimate is given by 

t 

c(F) = IJ SC；. 

/ 二 1 

Here c(F ) < \U\, since an assignment can satisfy more than one clause and thus ap¬ 
pear in more than one pair in U. 

To estimate c(F), we define a subset 5" of " with size c(F). We construct this set 
by selecting, for each satisfying assignment of F, exactly one pair in U that has this 
assignment; specifically, we can use the pair with the smallest clause index number, 
giving 


5 = {(/•〆/) \ \ <i <r, a e SC i? a i SC, for./ < /} 
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DNF Counting Algorithm II: 

Input: A DNF formula F with n variables. 

Output: Y = an approximation of c(F). 

1. X ^ 0. 

2. For k = 1 to m, do: 

(a) With probability ISCVI/HjSC. I choose, uniformly at random, an 
assignment a g SQ. 

(b) If a is not in any SC.j，j < i ， then X ^ X 

3. Return Y (X/m) 


Algorithm 10.2: DNF counting algorithm II. 

Since we know the size of f/, we can estimate the size of S by estimating the ratio 
\S\/\U\. We can estimate this ratio efficiently if we sample uniformly at random from 
U using our previous approach, choosing pairs uniformly at random from U and count¬ 
ing how often they are in S. We can avoid the problem we encountered when simply 
sampling assignments at random, because S is relatively dense in U. Specifically, since 
each assignment can satisfy at most t different clauses, |5|/|f/| 2 l/t. 

The only question left is how to sample uniformly from U. Suppose that we first 
choose the first coordinate, /. Because the ith clause has ISC,.| satisfying assignments, 
we should choose i with probability proportional to \SCi\. Specifically, we should 
choose i with probability 

I^C-| = |5Q| 

E!=,I5C ; I — 百 . 

We then can choose a satisfying assignment uniformly at random from SC, . This is 
easy to do; we choose the value True or False independently and uniformly at random 
for each literal not in clause i. Then the probability that we choose the pair (i,a) is 

Pr((/ ， a) is chosen) = Pr (/ is chosen) • Pr(a is chosen | i is chosen) 

= |5C,| 1 

-胥函 

1 

= 而’ 


giving a uniform distribution. 

These observations are implemented in Algorithm 10.2. 

Theorem 10.2: DNF counting algorithm II is a fully polynomial randomized approxi¬ 
mation scheme ( FPRAS ) for the DNF counting problem when m = \Ot/£ 2 ) ln(2/5)]. 


Proof: Step 2(a) of the algorithm chooses an element of U uniformly at random. The 
probability that this element belongs to S is at least 1/r. Fix any s > 0 and 5 > 0, 
and let 
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Then m is polynomial in r, e, and ln(l/5), and the processing time of each sample is 
polynomial in t. By Theorem 10.1, with this number of samples, X/m gives an (f. 5)- 
approximation of c(F)/\U\ and hence Y gives an (s, 5)-approximation of d F ). ■ 


10.3. From Approximate Sampling to Approximate Counting 


The example of DNF formulas demonstrates that there is a fundamental connection 
between being able to sample from an appropriate space and being able to count the 
number of objects with some property in that space. In this section we present the 
outline of a general reduction that shows that, if you can sample almost uniform 1} li 
solution to a “self-reducible” combinatorial problem, then you can construct u Rindom- 
lzed algorithm that approximately counts the number of solutions to thut problem. We 
demonstrate this technique for the problem of counting the number of independent sets 
in a graph. In the next chapter, we also consider the problem of counting the number 
of proper colorings in a graph, applying this technique there as w ell. 

We first need to formulate the concept of approximate uniform sampling. In this 
setting we are given a problem instance in the form of an input .v. and there is an un¬ 
derlying finite sample space 0,(x) associated with the input. 

Definition 10.3: Let w be the (random ) output of a sampling algorithm for a finite 
sample space Q. The sampling algorithm generates an f -uniform sample of if, for 
dfiy subset S of Q, 

l Pr ( w . -黑 


4 sampling algorithm is a fully polynomial almost uniform sampler (FPAUS) for a 
problem if, given an input x and a parameter £ > 0, it generates an e-uniform sample 
f'f ^2(.r) and runs in time that is polynomial in In and the size of the input x. 

In the next chapter, we introduce the notion of total variation distance, which allows 
tor an equivalent definition of an ^-uniform sample. 

As an example, an FPAUS for independent sets would take as input a graph G = 
1 WE) and a parameter e. The sample space would be all independent sets in the graph. 
The output would be an £■-uniform sample of the independent sets, and the time to pro¬ 
duce such a sample would be polynomial in the size of the graph and In f In fact, in 
the reduction that follows we only need the running time to be polynomial in f—but 
we use the standard definition given in Definition 10.3. 

Our goal is to show that, given an FPAUS for independent sets, we can construct an 
FPRAS for counting the number of independent sets. Assume that the input G has m 

edges, and let e\ . e m be an arbitrary ordering of the edges. Let £, be the set of the 

nrst i edges in E and let G, = (V, £/). Note that G = G m and that G；^\ is obtained 
: rom G, by removing a single edge. 
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We let Q(Gi) denote the set of independent sets in G,. The number of independent 
sets in G can then be expressed as 


\^(G m )\ x |^(G w _i)| 

i)l X 2)1 


— 2 ) 

|Q(G" 卜 3 ) 


|Q(Gi )| 

l^(G 0 )| 


x l^(G 0 )|. 


Since Go has no edges, every subset of V is an independent set and ^(Go) = 2 n . 
In order to estimate |Q(G)|, we need good estimates for the ratios 


=|^(G,-)| 


More formally, we will develop estimates r t for the ratios r,-, and then our estimate for 
the number of independent sets in G will be 


2' 物， 
/ =1 


while the true number is 


\Q(G)\=2 n l\r,, 

/ =1 

To evaluate the error in our estimate, we need to bound the ratio 


R = 


n 

/ =i 


n_ 

n 


Specifically, to have an 0,5)-approximation, we want Pr(\R — 1| < e) > 1 — 5. We 
will make use of the following lemma. 


Lemma 10.3: Suppose that for all i, \ < i < m, r t is an (e/2m, 8/m)-approximation 
for r,. Then 


Pr(\R - 1| <£)>\-S. 


Proof: For each 1 < / < m, we have 


Equivalently, 


P f - 心‘中 I -: 


<- 


By the union bound, the probability that |r ; - — r,-\ > (e/2m)r ; - for any / is at most 5; 
therefore, \r, — r { \ < (e/2m)r. for all / with probability at least 1 — 5. Equivalently, 


, £ ^ , E 
1 - < — < 1 -i - 

2m — 卩 — 2m 

holds for all / with probability at least 1 — 5. When these bounds hold for all /, we can 
combine them to obtain 
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Estimating r t : 

Input: Graphs G,_i = (V, i) and G, — ( V". E ,). 

Output: ?i = an approximation of r,. 

1. X ^ 0. 

2. Repeat for M = 「 1296w 2 s 一 2 ln(2m/5)] independent trials: 

(a) Generate an (e/6w)-uniform sample from ). 

(b) If the sample is an independent set in G.. let .V X 卞 1, 

3. Return r, ^ X/M. 


Algorithm 10.3: Estimating/'. 


\ - £ < 





^ f. 


giving the lemma. ■ 

Hence all we need is a method for obtaining an (f 2m. 8 m (-approximation for the r,. 
We estimate each of these ratios by a Monte Carlo algorithm that uses the FPAUS for 
sampling independent sets. To estimate r,. we sample independent sets in G,_i and 
compute the fraction of these sets that are also independent sets in G,. as described in 
Algorithm 10.3. The constants in the procedure were chosen to facilitate the proof of 
Lemma 10.4. 


Lemma 10.4: When m > 1 and 0 < s < \, the procedure for estimating r, yields an 
I f/2m, 8/m)-approximation for r,-. 


Proof: We first show that r, is not too small, avoiding the problem that we found in 
Section 10.2.1. Suppose that G,_i and G, differ in that edge {w, u} is present in G, but 
not in Gj An independent set in G, is also an independent set in G,_i，so 

Q(G,) c Q(G；_\). 


An independent set in ^2(G,_i) \ ^2(G,) contains both u and v. To bound the size of 
the set r2(G,-_i) \ r2(G, ), we associate each / g Q(G,.— 1 )\ Q(Gj) with an indepen¬ 
dent set / \ {u} g Q ( G, ). In this mapping an independent set I' e Q(G,) is associ¬ 
ated with no more than one independent set /’ U {u} e r2(G,-_i) \ ). and thus 

^(G,_i)\^(G,)| < \Q(Gi)\. It follows that 


n 二 I 邮 ， — i)| = |^(G / )| + |^(G,-_ 1 )\^(G,)| 



Now consider our M samples, and let X k = 1 if the kth sample is in ^2(G, ) and 
0 otherwise. Because our samples are generated by an (e/6"0-uniform sampler, by 
Definition 10.3 each X,- must satisfy 


261 







THE MONTE CARLO METHOD 


Pr(X, = 1)- 




|^(G,■_,)!! - 6m 
Since the are indicator random variables, it follows that 


E[X 々 ] - 


ir2(G ,— 丨 )| 

and further, by linearity of expectations. 


£ 

6 m 


E 


o 

M 


|^(G；)| 


£ 

6m 


We therefore have 


\Wn]-r；\ 


E 



_ M _ 




~ 6 m 

We now complete the lemma by combining (a) the fact just shown that E[r, ] is close 
to r, and (b) the fact that r-, will be close to E[f,] for a sufficiently large number of 
samples. Using r, > 1/2, we have 

s \ s \ 

E[r,J > r, 

bm 2 bm 3 

Applying Theorem 10.1 yields that, if the number of samples M satisfies 
31n(2m/5) 


M > 


(£/\2m) 2 (\/3) 


n 2m 

1296m 2 e' In —— , 


then 


Pr 


E[n) 


\2m 


Pr(|r, -E[r~]| > y^E[r ; ] 1 < - 


Equivalently, with probability 1 — 8/m, 


丄 <丄 < 1 + 丄 

12m — E[r,] - 12m 


( 10 . 1 ) 


As |E[r/] — r/1 < e/6m, we have that 


£ E [?j ] £ 

1 - < < 1 + 

6m I 、 r { 6m r, 

Using that r-, > 1/2 then yields 

e E[?i] s 

1 - < < 1 + —— . ( 10 . 2 ) 

3 m n 3w 

Combining equations (10.1) and (10.2), it follows that, with probability 1 — S/m, 


< 1 


£ 

3 m 


£ \ ri / £ \ / £ 


2m \ Jw / \ izm / r. 

This gives the desired (e/2w, 5 /m )-approximation 
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The number of samples M is polynomial in m. f. and In < 5 _l . and the time fc^r each sam¬ 
ple is polynomial in the size of the graph and In We therefore have the following 
theorem. 

Theorem 10.5: Given a fully polynomial almost uniform sampler (FPAUS ) for inde¬ 
pendent sets in any graph, we can construct a fully polynomial nuidomized approxi¬ 
mation scheme ( FPRAS) for the number of independent sets in a omph G. 

In fact, this theorem is more often used in the follow ing form. 

Theorem 10.6: Given a fully polynomial almost uniform sampler ( FPAUS) for inde¬ 
pendent sets in any graph with nui.ximum degree at mo si A. we can construct a fully 
polynomial randomized approximation scheme ( FPRAS ) for the number of indepen¬ 
dent sets in a graph G with maximum degree ai most A. 

This version of the theorem follows from our previous argument, since our graphs G, 
are subgraphs of the initial graph G. Hence, it' we start with a graph of maximum de¬ 
gree at most A, then our FPAUS need only work on graphs with nmximuni degree at 
most A, In the next chapter, we will see how to create an FPAUS for graphs with max¬ 
imum degree 4. 

This technique can be applied to a broad range of combinatorial counting problems. 
For example, in Chapter 11 we consider its application to rinding proper colorings of 
a graph G. The only requirement is that we can construct a sequence of rctinements 
of the problem, starting with an instance that is easy to count <the number of indepen¬ 
dent sets in a graph with no edges, in our example) and ending with the actual counting 
problem, and such that the ratio between the counts in successive instance^ is at most 
polynomial in the size of the problem. 

10.4. The Markov Chain Monte Carlo Method 


The Monte Carlo method is based on sampling. It is often difficult to generate a random 
sample with the required probability distribution. For example, w e saw in the previous 
section that we can count the number of independent sets in a graph if w e can generate 
an almost uniform sample from the set of independent sets. But how can w e generate 
an almost uniform sample? 

The Markov chain Monte Carlo (MCMC) method provides a very general approach 
to sampling from a desired probability distribution. The basic idea is to define an 
ergodic Markov chain whose set of states is the sample space and w hose stationary dis¬ 
tribution is the required sampling distribution. Let Xo, X\ . X n be a run of the chain. 

The Markov chain converges to the stationary distribution from any starting state Xq 
and so, after a sufficiently large number of steps r, the distribution of the state X r will 
be close to the stationary distribution, so it can be used as a sample. Similarly, repeat¬ 
ing this argument with X r as the starting point, we can use as a sample, and so 
on. We can therefore use the sequence of states X,., Xt,-, _as almost indepen¬ 

dent samples from the stationary distribution of the Markov chain. The efficiency of 
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this approach depends on (a) how large r must be to ensure a suitably good sample 
and (b) how much computation is required for each step of the Markov chain. In this 
section, we focus on finding efficient Markov chains with the appropriate stationary 
distribution and ignore the issue of how large r needs to be. Coupling, which is one 
method for determining the relationship between the value of r and the quality of the 
sample, is discussed in the next chapter. 

In the simplest case, the goal is to construct a Markov chain with a stationary distri¬ 
bution that is uniform over the state space ^2. The first step is to design a set of moves 
that ensures the state space is irreducible under the Markov chain. Let us call the set 
of states reachable in one step from a state x (but excluding x) the neighbors of x, de¬ 
noted by N(x). We adopt the restriction that if y e Nix') then also x e N(y). Generally 
N(x) will be a small set, so that performing each move is simple computationally. 

We again use the setting of independent sets in a graph G = (V, E) as an example. 
The state space is all of the independent sets of G. A natural neighborhood framework 
is to say that states x and v, which are independent sets, are neighbors if they differ in 
just one vertex. That is, x can be obtained from v by adding or deleting just one ver¬ 
tex. This neighbor relationship guarantees that the state space is irreducible, since all 
independent sets can reach (respectively, can be reached from) the empty independent 
set by a sequence of vertex deletions (respectively, vertex additions). 

Once the neighborhoods are established, we need to establish transition probabil¬ 
ities. One natural approach to try would be performing a random walk on the graph 
of the state space. This might not lead to a uniform distribution, however. We saw in 
Theorem 7.13 that, in the stationary distribution of a random walk, the probability of 
a vertex is proportional to the degree of the vertex. Nothing in our previous discus¬ 
sion requires all states to have the same number of neighbors, which is equivalent to 
all vertices in the graph of the state space having the same degree. 

The following lemma shows that, if we modify the random walk by giving each 
vertex an appropriate self-loop probability, then we can obtain a uniform stationary 
distribution. 

Lemma 10.7: For a finite state space Q and neighborhood structure {N(X) \ x G ^}, 
let N — max. v€ ^| A^(x)|. Let M be any number such that M > N. Consider a Markov 
chain where 


I l/M if x y and y e N(x), 

0 if x 7 ^ y and y ^ N(x), 

1 - N(x)/M ifx = y. 

If this chain is irreducible and aperiodic, then the stationary' distribution is the uniform 
distribution. 

Proof: We show that the chain is time reversible and then apply Theorem 7.10. For 
any x 7 ^ y, if n x = 7i v then 


丌 V 尸 ''V = 丌、 ' 尸 V’X ， 
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since P x y = P v . v = 1/M. It follows that the uniform distribution tt v = 1/|^| is the 
stationary distribution. ■ 

Consider now the following simple Markov chain, whose states are independent sets 
in a graph G = (V, E). 

1. X ( ) is an arbitrary independent set in G. 

2. To compute 义 ,. + 丨： 

(a) choose a vertex v uniformly at random from \": 

(b) if v e Xi then X ;+ \ — X,- \ {u}; 

(c) if v ^ Xj and if adding v to X, still gives an independent set. then X j+ \ = 
XiU{v}: 

(d) otherwise, X i+ \ = X,-. 

This chain has the property that the neighbors of a state .V, are all independent sets 
that differ from X, in just one vertex. Since every state can reach and is reachable from 
the empty set, the chain is irreducible. Assuming that G has at least one edge (u. v), 
then the state {u} has a self-loop ( P, , > 0). and the chain is aperiodic. Further, when 
_v x, it follows that P x v = 1/| V 7 i or 0. Lemma 10.7 therefore applies, and the sta¬ 
tionary distribution is the uniform distribution. 

10.4.1. The Metropolis AIgorithm 

We have seen how to construct chains with a uniform stationary distribution. In some 
cases, however, we may want to sample from a chain with a nonuniform stationary 
distribution. The Metropolis algorithm refers to a general construction that transforms 
any irreducible Markov chain on a state space Q to a time-reversible Markov chain 
with a required stationary distribution. The approach generalizes the idea we used be¬ 
fore to create chains with uniform stationary distributions: add self-loop probabilities 
to states in order to obtain the desired stationary distribution. 

Let us again assume that we have designed an irreducible state space for our Markov 
chain; now we want to construct a Markov chain on this state space with a station¬ 
ary distribution 7 i v — b(x)/B, where for all x g ^ we have b(x) > 0 and such 
that B = xeQ is finite. As we see in the following lemma (which generalizes 
Lemma 10.7), we only need the ratios between the required probabilities: the sum B 
can be unknown. 


Lemma 10.8: For a finite state space Q and neighborhood structure {/V( X > ； .v e Q}. 
let N = max AG ^|A^(.\-)|. Let M be any number such that M > N. For all x e Q. let 
rr A > 0 be the desired probability of state x in the stationary distribution. Consider a 
Markov chain where 

I (1/M) min(l, tt'./tTv) if j ^ v and v e N(.\). 

0 ifx 7 ^ v and y ^ N{x). 

1 - E v^.v p v.v Ifx = v. 
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Then, if this chain is irreducible and aperiodic, the stationary distribution is given by 
the probabilities ir x . 

Proof: As in the proof of Lemma 10.7, we show that the chain is time reversible and 
apply Theorem 7.10. For any 又 # if 丌 义 £ 丌 '. then P x y = 1 and P v v = 7T x /tt y . It 
follows that 7T V P x v = 丌 ',/\,.^ Similarly, if tt x > tt v then P v-v = JT y /iT x and P v v = 1, 
and it follows that ir x P x _ x — tt y P y x . By Theorem 7.10, the stationary distribution is 
given by the values n x . ■ 


As an example of how to apply Lemma 10.8, let us consider how to modify our previ¬ 
ous Markov chains on independent sets. Let us suppose that now we want to create a 
Markov chain where, in the stationary distribution, each independent set I has proba¬ 
bility proportional to 入… for some constant parameter 入 > 0. That is, 7i v = 入 1 /yI /Z ?， 
where I x is the independent set corresponding to state x and where B — A. |/vl . 
When 入 = 1， this is the uniform distribution; when 入 > 1， larger independent sets have 
a larger probability than smaller independent sets; and when A. < 1, larger independent 
sets have a smaller probability than smaller independent sets. 

Consider now the following variation on the previous Markov chain for independent 
sets in a graph G — (V, E). 

1. Xq is an arbitrary independent set in G. 

2. To compute X, + i ： 

(a) choose a vertex v uniformly at random from V ; 

(b) if v gX” set X,- +1 = Xj \ {u} with probability min(l, 1/A.); 

(c) if u ^ X ； and if adding v to X, still gives an independent set, then put X,- +1 = 
X ； U {u} with probability min(l. A.); 

(d) otherwise, set X i+ \ = Xj. 


We now follow a two-step approach. We first propose a move by choosing a ver¬ 
tex v to add or delete, where each vertex is chosen with probability 1/M; here M = 
\V\. This proposal is then accepted with probability min(l, tc x /it x ), where x is the cur¬ 
rent state and y is the proposed state to which the chain will move. Here, 兀 )./;^ is 入 
if the chain attempts to add a vertex and is l /入 if the chain attempts to delete a vertex. 
This two-step approach is the hallmark of the Metropolis algorithm: each neighbor is 
selected with probability 1/M, and then it is accepted with probability min(l, tt v /jt x ). 
Using this two-step approach, we naturally obtain that the transition probability P x y is 


P,. 




so Lemma 10.8 applies. 

It is important that, in designing this Markov chain, we never needed to know B = 
graph with n vertices can have exponentially many independent sets, and 
calculating this sum directly would be too expensive computationally for many graphs. 
Our Markov chain gives the correct stationary distribution by using the ratios n 、 j tt x ' 
which are much easier to deal with. 
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10.5. Exercises 


Exercise 10.1: Formally prove Theorem 10.1. 


Exercise 10.2: Another method for approximating ,t using Monte Carlo techniques is 
based on Buffon’s needle experiment. Research and explain Buff on's needle experi¬ 
ment, and further explain how it can be used to obtain an approximation for .t. 

Exercise 10.3: Show that the following alternative definition is equivalent to the defi¬ 
nition of an FPRAS given in the chapter: A fully polynomial randomized approximation 
scheme (FPRAS) for a problem is a randomized algorithm for which, given an input a 
and any parameter s with 0 < e < 1. the algorithm outputs an (f. l/4)-approximation 
in time that is polynomial in l/s and the size of the input .v. (Hint: To boost the prob¬ 
ability of success from 3/4 to 1 — < 5 . consider the median of several independent runs 
of the algorithm. Why is the median a better choice than the mean?) 

Exercise 10.4: Suppose we have a class of instances of the DNF satisfiability prob¬ 
lem, each with a{n) satisfying truth assignments for some polynomial a. Suppose we 
apply the naive approach of sampling assignments and checking whether they satisfy 
the formula. Show that, after sampling 2 n/1 assignments, the probability of finding 
even a single satisfying assignment for a given instance is exponentially small in n. 

Exercise 10.5: (a) Let S\, Sj,..S m be subsets of a finite universe U. We know |5,. | 
for \ < i < m. We wish to obtain an (e, (^-approximation to the size of the set 

m 

^ = U 5 - 

i = \ 

We have available a procedure that can, in one step, choose an element uniformly at 
random from a set Si • Also, given an element x e (7, we can determine the number of 
sets Sj for which x e Sj. We call this number c(x). 

Define p ； to be 


The y'th trial consists of the following steps. We choose a set S r where the probability 
of each set S, being chosen is p ,，and then we choose an element x, uniformly at ran¬ 
dom from Sj. In each trial the random choices are independent of all other trials. After 
t trials, we estimate |5| by 


(辆 )•'). 

Determine - as a function of m, e, and <5 - the number of trials needed to obtain an 
a • (^-approximation to j^l. 
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(b) Explain how to use your results from part (a) to obtain an alternative approxi¬ 
mation algorithm for counting the number of solutions to a DNF formula. 


Exercise 10.6: The problem of counting the number of solutions to a knapsack instance 
can be defined as follows: Given items with sizes a\,a 2 ,. ..,a n >0 and an integer b > 
0, find the number of vectors (x\,x 2 ,..., x n ) e {0,1} ,? such that J2'!= i a f x i - The 
number b can be thought of as the size of a knapsack, and the x { denote whether or 
not each item is put into the knapsack. Counting solutions corresponds to counting the 
number of different sets of items that can be placed in the knapsack without exceeding 
its capacity. 


(a) A naive way of counting the number of solutions to this problem is to repeatedly 
choose (xi,x 2 ,... ,x w ) g {0, 1} w uniformly at random, and return the 2” times the 
fraction of samples that yield valid solutions. Argue why this is not a good strat¬ 
egy in general; in particular, argue that it will work poorly when each a, is 1 and 
b = yfn. 

(b) Consider a Markov chain X 0 , X),... on vectors (xi,x 2 ,... ,x„) g {0,1} H . Suppose 
Xj is {x\,x 2 , ■ ■ -,x n ). At each step, the Markov chain chooses i e[\,n] uniformly 
at random. If x, = 1， then X J + \ is obtained from Xj by setting x,- to 0. If x,- = 0, 
then Xj + \ is obtained from Xj by setting x } to 1 if doing so maintains the restriction 

n i= , a;Xi < b. Otherwise, X J + \ = X r 

Argue that this Markov chain has a uniform stationary distribution whenever 
EUa,>b. Be sure to argue that the chain is irreducible and aperiodic. 

(c) Argue that, if we have an FPAUS for the knapsack problem, then we can derive an 

FPRAS for the problem. To set the problem up properly, assume without loss of 
generality that a\ < a 2 < ■ ■ ■ < a n . Let b {) = 0 and b, = = 1 a,-. Let Q{bi) be 

the set of vectors (x l7 x 2 ,..., x n ) e {0,1} ,! that satisfy H ajXj < b f . Let k be 
the smallest integer such that b k > b. Consider the equation 


^2(b) \ = 


I ⑽ )1 

I ⑽人- 丨）1 


I ⑽人 -2)1 乂…乂 I ⑽ 0)1 


X I ⑽ 0)1. 


You will need to argue that \Q(bi-\)\/\Q(bj)\ is not too small. Specifically, argue 
that \ ^2(b,)\ < (n + 1)1^(^,_i)|. 


Exercise 10.7: An alternative definition for an s-uniform sample of Q is as follows: A 
sampling algorithm generates an e-uniform sample w if, for all x e 


|Pr(w =x) - 1/|^|| 

- < 

l/l^l - 

Show that an e-uniform sample under this definition yields an e-uniform sample as 
given in Definition 10.3 

Exercise 10.8: Let S = i~ 2 = 丌 2 /6. Design a Markov chain based on the Me¬ 

tropolis algorithm on the positive integers such that, in the stationary distribution, tt ,-= 
i/Si 2 . The neighbors of any integer / > 1 for your chain should be only / — 1 and / + 1, 
and the only neighbor of 1 should be the integer 2. 
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Exercise 10.9: Recall the Bubblesort algorithm of Exercise 2.22. Suppose we have n 
cards labeled 1 through n. The order of the cards X can be the state of a Markov chain. 
Let f(X) be the number of Bubblesort moves necessary to put the cards in increasing 
sorted order. Design a Markov chain based on the Metropolis algorithm such that, in 
the stationary distribution, the probability of an order X is proportional to /. nX) for a 
given constant 入 > 0. Pairs of states of the chain arc connected if they correspond to 
pairs of orderings that can be obtained by interchanging at most two cards. 

Exercise 10.10: A A-coloring C of an undirected graph G = ( K £) is an assignment 
labeling each vertex with a number, representing a color, from the set {1,2..... A}. An 
edge (u, v) is improper if both u and r are assigned the same color. Let 1(C) be the 
number of improper edges of a coloring C. Design a Markov chain based on the Me¬ 
tropolis algorithm such that, in the stationary distribution, the probability of a coloring 
C is proportional to X liC ' for a given constant A. > 0. Pairs of states of the chain are 
connected if they correspond to pairs of colorings that differ in just one vertex. 


Exercise 10.11: In Section 10.4.1 we constructed a Markov chain on the independent 
sets of a graph where, in the stationary distribution, tt x = AJ /v l/5. Here / v is the in¬ 
dependent set corresponding to state x and B 二 入 1 ’' 1 . Using a similar approach, 
construct a Markov chain on the independent sets of a graph excluding the empty set, 
where tt x = |/ A - |/Z? for a constant B. Because the chain excludes the empty set, you 
should first design a neighborhood structure that ensures the state space is connected. 


Exercise 10.12: The following generalization of the Metropolis algorithm is due to 
Hastings. Suppose that we have a Markov chain on a state space Q given by the tran¬ 
sition matrix Q and that we want to construct a Markov chain on this state space with 
a stationary distribution tt x — b(.\)/B, where for all .v e Q, b(x) > 0 and B = 
Uvefi ^(-v) is finite. Define a new Markov chain as follows. When X„ = .v, gener¬ 
ate a random variable Y with Pr(F = y) = Q x , x . Notice that Y can be generated by 
simulating one step of the original Markov chain. Set X /(+ i to Y with probability 


mini 


Ttx Qx. 
.^xQx. 


and otherwise set X, ;+ i to X n . Argue that, if this chain is aperiodic and irreducible, 
then it is also time reversible and has a stationary distribution given by the rr v . 


Exercise 10.13: Suppose we have a program that takes as input a number .v on the real 
interval [0,1] and outputs fix) for some bounded function / taking on values in the 
range [\,b]. We want to estimate 



f(x) dx. 


Assume that we have a random number generator that can generate independent uni¬ 
form random variables X\,X 2 .Show that 
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f(Xj) 

gives an (s, ^-approximation for the integral for a suitable value of m. 

10.6. An Exploratory Assignment on Minimum Spanning Trees 

Consider a complete, undirected graph with ⑴ edges. Each edge has a weight, which 
is a real number chosen uniformly at random on [0,1]. 

Your goal is to estimate how the expected weight of the minimum spanning tree 
grows as a function of n for such graphs. This will require implementing a minimum 
spanning tree algorithm as well as procedures that generate the appropriate random 
graphs. (You should check to see what sorts of random number generators are avail¬ 
able on your system and determine how to seed them - say, with a value from the 
machine’s clock.) 

Depending on the algorithm you use and your implementation, you may find that 
your program uses too much memory when n is large. To reduce memory when n is 
large, we suggest the following approach. In this setting, the minimum spanning tree 
is extremely unlikely to use any edge of weight greater than k(n) for some function 
k(n). We can first estimate k(n) by using repeated runs for small values of n and then 
throw away edges of weight larger than k{n) when n is large. If you use this approach, 
be sure to explain why throwing away edges in this manner will not lead to a situation 
where the program finds a spanning tree that is not actually minimal. 

Run your program for n = 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, and 
larger values, if your program runs fast enough. Run your program at least five times 
for each value of n and take the average. (Make sure you re-seed the random number 
generator appropriately!) You should present a table listing the average tree size for the 
values of n that your program runs successfully. What seems to be happening to the 
average size of the minimum spanning tree as n grows? 

In addition, you should write one or two pages discussing your experiments in more 
depth. The discussion should reflect what you have learned from this assignment and 
might address the following topics. 

• What minimum spanning tree algorithm did you use, and why? 

• What is the running time of your algorithm? 

• If you chose to throw away edges, how did you determine k(n), and how effective 
was this approach? 

• Can you give a rough explanation for your results? (The limiting behavior as n 
grows large can be proven rigorously, but it is very difficult; you need not attempt 
to prove any exact result.) 

• Did you have any interesting experiences with the random number generator? Do 
you trust it? 
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CHAPTER ELEVEN $ 

Coupling of Markov Chains 


In our study of discrete time Markov chains in Chapter 7, we found that ergodic Markov 
chains converge to a stationary distribution. However, we did not determine how quickl、' 
they converge, which is important in a number of algorithmic applications, such as sam¬ 
pling using the Markov chain Monte Carlo technique. In this chapter, w e iniroduce 
the concept of coupling, a powerful method for bounding the rate of com ergencc ot 
Markov chains. We demonstrate the coupling method in several applications, includ¬ 
ing card-shuffling problems, random walks, and Markov chain Monte Carlo sampling 
of independent sets and vertex coloring. 


11.1. Variation Distance and Mixing Time 


Consider the following method for shuffling n cards. At each step, a card is chosen in¬ 
dependently and uniformly at random and put on the top of the deck. We can think 
of the shuffling process as a Markov chain, where the state is the current order of the 
cards. You can check that the Markov chain is finite, irreducible, and aperiodic, and 
hence it has a stationary distribution. 

Let .v be a state of the chain, and let N(x) be the set of states that can reach a in one 
step. Here |A^(.r)| = /7, since the top card in x could have been in // different places 
in the previous step. If 7i v is the probability associated with state \ in the >tationary 
distribution, then for any state .v we have 

Kx = - ttv 

n ^ 

The uniform distribution satisfies these equations, and hence the unique stationary dis¬ 
tribution is uniform over all possible permutations. 

We know that the stationary distribution is the limiting distribution of the Markov 
chain as the number of steps grows to infinity. If we could run the chain “forever”, 
then in the limit we would obtain a state that was uniformly distributed. In practice, 
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x 

Figure 11.1 ： Example of variation distance. The areas shaded by upward diagonal lines correspond 
to values x where D\(x) < Di{x)\ the areas shaded by downward diagonal lines correspond to val¬ 
ues .t where D\(x) > D 2 (x). The total area shaded by upward diagonal lines must equal the total 
area shaded by downward diagonal lines, and the variation distance equals one of these two areas. 


we run the chain for a finite number of steps. If we want to use this Markov chain to 
shuffle the deck, how many steps are necessary before we obtain a shuffle that is close 
to uniformly distributed? 

To quantify what we mean by “close to uniform”，we must introduce a distance 
measure. 

Definition 11.1: The variation distance between two distributions D\ and Dj on a 
countable state space S is given by 

\\D,-D 2 \\ = i^|D,(x)-D 2 (x)|. 

A pictorial example of the variation distance is given in Figure 11.1. 

The factor 1/2 in the definition of variation distance guarantees that the variation dis¬ 
tance is between 0 and 1. It also allows the following useful alternative characterization. 

Lemma 11.1: For any A C 5, let D,(A) = ^ veA Dj(x)fori = 1, 2. Then 
\\D { - D 2 \\ =max|D,(A)-D 2 (A)|. 

A careful examination of Figure 11.1 helps make the proof of this lemma transparent. 

/Voo/: Let 5 + g 5 be the set of states such that D\(x) > D 2 (x), and let c S be 
the set of states such that D?(x) > D\{x). 

Clearly, 

max D\{A) - D,(A) = D,(5 + ) - D 2 (S + ) 

ACS " 

and 

max D 2 (A) - D\{A) = Dj(S~) - Di(S_). 

Acs " 

But since D t (S) = D 2 (S) = l, we have 

D,(5 + ) + Di(S-) = D 2 (S + ) + D 2 (S~) = 1, 
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which implies that 

D,(5 + ) - D 2 (S + ) = ) — D { (S-). 

Hence 

max|Z)KA) - D ，（ A)| = |DK5 + ) - > = /) 丨 （ )|. 

A<ZS ~ _ " 

Finally, since 

\D { (S + ) - D 2 (S+)\ + \D { (S-)- D 2 (S- >1 = [ D,i.v) - D 2 (x) = 2||D, - D：||. 

A 

we have 

max|Di(A) — D-^(A) = D - D：. 

AgS _ _ 

completing the proof. ■ 

As an application of Lemma 11.1, suppose that \ve run our shuffling Markov chain until 
the variation distance between the distribution of the chain and the uniform distribution 
is less than s. This is a strong notion of close to uniform, because every permutation of 
the cards must have probability at most 1/” ！ + ^ • In fact the bound on the variation dis¬ 
tance gives an even stronger statement: For any subset A ^ S. the probability that the 
final permutation is from the set A is at most m A) — ^. For example, suppose some¬ 
one is trying to make the top card in the deck an ace. If the variation distance from the 
distribution to the uniform distribution is less than f. \ve can safely say that probabil¬ 
ity that an ace is the first card in the deck is at most greater than if w e had a perfect 
shuffle. 

As another example, suppose \ve take a 52-card deck and shuffle all the cards - but 
leave the ace of spades on top. In this case, the variation distance between the result¬ 
ing distribution D l and the uniform distribution D: could be bounded by considering 
the set B of states where the ace of spaces is on the top of the deck: 

1 51 

II - ^ 2 ll = max|Di( A) -/)：( .-A > > D\{B) - D 2 (B)\ = 1- — =—. 

52 52 

The definition of variation distance coincides with the definition of an e-uniform 
sample (given in Definition 10.3). A sampling algorithm returns an e-uniform sample 
on ^2 if and only if the variation distance between its output distribution D and the uni¬ 
form distribution U satisfies 


\\D-U\\ <s. 

Bounding the variation distance between the uniform distribution and the distribution 
of the state of a Markov chain after some number of steps can therefore be a useful 
way of proving the existence of efficient e-uniform samplers, which (as we showed in 
Chapter 10) can in turn lead to efficient approximate counting algorithms. 

We now consider how to bound this variation distance after t steps. In what follows, 
we assume that the Markov chains under consideration are ergodic discrete space and 
discrete time chains with well-defined stationary distributions. The following defini¬ 
tions will be useful. 
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Definition 11.2: Let n be the stationary distribution of a Markov chain with state 
space S. Let p[ represent the distribution of the state of the chain starting at state x 
after t steps. We define 


A v (0 = \\p[ - 7i||; A(r) = max A v (r). 

That is, A x (t) is the variation distance between the stationary distribution and p[，and 
A(t) is the maximum of these values over all states x. 

We also define 


t k (s) = min{r : A v (r) < £}; r(e) = max r v (e). 

* . — xeS ' 

That is, t x (s) is the first step t at which the variation distance between p[ and the sta¬ 
tionary distribution is less than s, and t(s) is the maximum of these values over all 
states x. 


When i(s) is considered as a function of e, it is generally called the mixing time of 
the Markov chain. A chain is called rapidly mixing if z(s) is polynomial in log(l/f) 
and the size of the problem. The size of the problem depends on the context; in the 
shuffling example, the size would be the number of cards. 


11.2. Coupling 


Coupling of Markov chains is a general technique for bounding the mixing time of a 
Markov chain. 


Definition 11.3: A coupling of a Markov chain M t with state space S is a Markov chain 
Z t = (X, ? Y t ) on the state space S x S such that : 

Pr(X, + i — x' \Z t — (.\% _v)) = Pr(M ?+ i = x r \ M, = x)\ 

Pr( Y,^\ — y' \ Z, — (x.y)) — Pr(M ?+ i = y' | M, — _y)- 

That is, a coupling consists of two copies' of the Markov chain M running simultane¬ 
ously. These two copies are not literal copies; the two chains are not necessarily in the 
same state, nor do they necessarily make the same move. Instead, we mean that each 
copy behaves exactly like the original Markov chain in terms of its transition probabil¬ 
ities. One obvious way to obtain a coupling is simply to take two independent runs of 
the Markov chain. As we shall see, such a coupling is generally not very useful for our 
purposes. 

Instead, we are interested in couplings that (a) bring the two copies of the chain to 
the same state and then (b) keep them in the same state by having the two chains make 
identical moves once they are in the same state. When the two copies of the chain reach 
the same state, they are said to have coupled. The following lemma motivates why we 
seek couplings that couple. 
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Lemma 11.2 [Coupling Lemma]: LetZ, = {Xj,Y,)bea coupling for a Markov chain 
M on a state space S. Suppose that there exists a T such that, for every , y G 5, 


Then 


Pr(X T ^ Y t I X () = .r, r„ = v) < 


r(£) 5 T. 


That is, for any initial state, the variation distance between the distribution of the state 
of the chain after T steps and the stationary distribution is at tnost t. 

Proof: Consider the coupling when Yo is chosen according to the stationary distribu¬ 
tion and Xq takes on any arbitrary value. For the given T and f and for any A C 5, 

Pr(X 7 e A) > Pr((X 7 = Y r ) ^ ( Y r e .A )) 

=1 — Pr( (X/ Yr > - ( ^ r ^ A)) 

2 (1 - Pr( 名 ，4 ) > — PuY/ 本 }'/ ) 

> PrO 7 / G A) - f 
=n(A) - t. 

Here the second line follows from the union bound. For the third line, we used the 
fact that Pr(AV / Y T ) < e for any initial states X M and in particular, this holds 
when Yo is chosen according to the stationary distribution. For the last line, w e used that 
Pr(>V e A) = since Y r is also distributed according to the stationary distribution. 

The same argument for the set S - A shows that Pr( X/- € .A ) > .t( S - .4) — f. or 
Pr(X T e A) < 71(/4) + e . 

It follows that 


max|/ 7 J (A) - tt(A)\ < f. 

A.. /A 

so by Lemma 11.1 the variation distance from the stationary distribution after the chain 
runs for T steps is bounded above by e. ■ 

11.2.1. Example: Shuffling Cards 

To apply the coupling lemma effectively to the card-shuffling Markov chain, we must 
choose an appropriate coupling. Given two copies X, and Y, of the chain in different 
states, one possibility for the coupling is to choose a position / uniformly at random 
from 1 to n and simultaneously move to the top the jlh card from the top in both chains. 
This is a valid coupling, because each chain individually acts as the original shuffling 
Markov chain. Although this coupling is natural, it does not appear immediately use¬ 
ful. Since the chains start in different states, the yth cards from the top in the two chains 
will usually be different. Moving these two different cards to the top does not seem to 
bring the two copies of the chain toward the same state. 

A more useful coupling is to choose a position / uniformly at random from 1 to n 
and then obtain X {+ \ from X t by moving the 7 th card to the top. Denote the value of 
this card by C. To obtain Y r+ \ from Y t , move the card with value C to the top. The 
coupling is again valid, because in both chains the probability a specific card is moved 
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to the top at each step is l/n. With this coupling, it is easy to see by induction that, 
once a card C is moved to the top, it is always in the same position in both copies of 
the chain. Hence, the two copies are sure to become coupled once every card has been 
moved to the top at least once. 

Now our coupling problem for the shuffling Markov chain looks like a coupon col¬ 
lector^ problem; to bound the number of steps until the chains couple, we simply bound 
how many times cards must be chosen uniformly at random before every card is cho¬ 
sen at least once. We know that when the Markov chain runs for n\nn + cn steps, the 
probability that a specific card has not been moved to the top at least once is at most 



and thus (by the union bound) the probability that any card has not been moved to the 
top at least once is at most e _r . Hence, after only « In « + « ln(l/e) = n \n(n/s) steps, 
the probability that the chains have not coupled is at most s. The coupling lemma al¬ 
lows us to conclude that the variation distance between the uniform distribution and 
the distribution of the state of the chain after n \n{n/s) steps is bounded above by s. 

11.2.2. Example: Random Walks on the Hyper cube 

Recall from Section 4.5.1 that an /7-dimensional hypercube, or /z-cube, consists of yV = 
2" nodes numbered from 0 to — 1. Let .v = (a ■卜 ... ， x n ) be the binary representation 
of x. Nodes x and v are connected by an edge if and only if x and v differ in exactly 
one bit. 

We consider the following Markov chain defined on the cube. At each step, choose 
a coordinate i uniformly at random from [1, / 7 J. The new state is obtained from the 
current state x by keeping all coordinates of x the same, except possibly for x,-. The co¬ 
ordinate Xj is set to 0 with probability 1/2 and to 1 with probability 1/2. This Markov 
chain is exactly the random walk on the hypercube, except that with probability 1/2 
the chain stays at the same vertex instead of moving to a new one, which removes the 
potential problem of periodicity. It follows easily that the stationary distribution of the 
chain is uniform over the vertices of the hypercube. 

We bound the mixing time t ( s ) of this Markov chain by using the obvious coup¬ 
ling between two copies X, and Y, of the Markov chain: at each step, we have both 
chains make the same move. With this coupling, the two copies of the chain will surely 
agree on the /th coordinate, once the /th coordinate has been chosen for a move of the 
Markov chain. Hence the chains will have coupled after all n coordinates have each 
been chosen at least once. 

The mixing time can therefore be bounded by bounding the number of steps until 
each coordinate has been chosen at least once by the Markov chain. This again reduces 
to the coupon collector’s problem, just as in the case of the shuffling chain. By the 
same argument, the probability is less than s that after n In ⑽ _1 ) steps the chains have 
not coupled, and hence by the coupling lemma the mixing time satisfies 

r(e) < n ln(«£ _1 ). 
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11.2.3. Example: Independent Sets of Fixed Size 


We consider a Markov chain whose states are all independent sets of size exactly k in 
a graph G = (V, E). Because we restrict ourselves to independent sets of a fixed size, 
we need a different Markov chain than the chain for all independent sets developed in 
Section 10.4. A move is made from the independent set X, by choosing a vertex v e X f 
uniformly at random and a vertex w e V uniformly at random. The move m(v, w, X,) 
can be described as follows: if w ^ X t and ( X, — {r}) J {w} is an independent set, 
then X, +[ = (X t - {u}) U {w}: otherwise. X,^ { = X；. Let n be the number of vertices 
in the graph and let A be the maximum degree of am vertex. We show here that this 
chain is rapidly mixing whenever k < n /(3A + 3). Wc leave the task of showing that 
the Markov chain is ergodic and has a uniform stationary distribution as Exercise 11.11, 
and assume this in the following argument. 

We consider a coupling on Z, = ( X,. Y,). Our coupling w ill require an arbitrary 
perfect matching M between the vertices of X ； - Y ： and >'■ - A "； at each step: for ex¬ 
ample, we may label the vertices 1 to and match the elements of .Y, — Y, in sorted 
order via a one-to-one mapping with the elements of }'； — X, in sorted order. For our 
coupling, we first choose a transition for the chain A", by choosing i e X ； and ir e V 7 
uniformly at random and then perform the move m(v. ir. X ,). Clearly, the copy of the 
chain X, follows the original Markov chain faithfully as required by Definition 11.3. 
For the transition of Y,. if v e Y t then we use the same pair of vertices v and ir and 
perform the move m{v, w, Y t ): if v ^ Y t , we perform the move u\ Y,) (where 

M ( v) denotes the vertex matched to i>). The copy of the chain Y, also follows the origi¬ 
nal Markov chain faithfully, since each pair of vertices with v g Y, and w g V is chosen 
with probability \/kn • 

An alternative way of establishing the coupling is as follows. We again choose v g 
X, and w e V uniformly at random and then perform the move m(v, w, X,) in the chain 
If v e y,, we perform the move m(u, w, V, ) in the chain Y,: otherwise, we choose 
uniformly at random a vertex v' G Y t — X t and perform the move m{v\ w, Y { ) in the 
chain Y t . We see in Exercise 11.10 that this also satisfies Definition 11.3. 

Let cl t = \X t — Y t \ measure the difference between the two independent sets after t 
steps. Clearly d t can change by at most 1 at each step. We show that d, is more likely 
to decrease than increase, and we use this fact to establish an upper bound on the prob¬ 
ability that d t > 0 for sufficiently large t. 

Suppose that d, > 0. In order for cJ t+ \ = + 1, it must be that at time i the ver- 

tex v is chosen from X, n F,, and w is chosen so that there is a transition in exactly 
one of the chains. Thus, w must be either a vertex or a neighbor of a vertex in the set 
[X, — Y t ) U (Y r — X t ). h follows that 


Pr(d t+l = d, ^ \ \ d, > 0) < 


k - d t 2d t { △ + 1) 
一 k n 


Similarly, in order for d l+ \ = d, — \, it is sufficient that at time t vve have v ^ Y ； and 
ir neither a vertex nor a neighbor of a vertex in the set X, U Y, - U’.〆}• Note that 
X, U Y t \ — k + d,. Therefore, 
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ph 旧, >0) 4 ”- 叫 ， +1) 
k n 


We thus have, for d t > 0, 


E[J f+ i I d,] = d, + Pr(d t+ \ = t/, + 1) - Pr(d t+ \ — d, - 

k - d[ (A + 1) dj n - {k d! — 2)(A + 1) 

k n k n 


< dt + 


d t 




< d t \ 


— (3k — d t — 2)(A + 1) 
kn 

n- (3k-3)(A + \)\ 
hi )' 


Once d t = 0, the two chains follow the same path and so E[t/, + i | = 0] = 0. 

Using the conditional expectation equality, we have 


E[d l+{ ] = E[E[d t+l |^,]] <E[^,]| 


(n - 3k + 3)(A + 1) 


kn 


By induction, we find that 


E[^] < Jo 1 


n- (3k- 3)(A + 1) 
kn 


Since do < k and since d t is a nonnegative integer, it follows that 


Vv{d t > 1) < E[d t ] < - /7 ~ (3/C ~^ )(A + 1) ^j < e -^(^-(3/ ： -3)(A+i))A«_ 

A consequence of this result is that the variation distance converges to zero whenever 
k < n /(3A + 3), and in this case 

kn In £- _l 

r(e) < - . 

- n-(3k-3)(A^\) 

We thus find that T(e) is polynomial in n and ln(l/£), implying that the chain is rapidly 
mixing, whenever k < n/(3A + 3). 

We can actually improve upon this result. In Exercise 11.12 we use a slightly more 
sophisticated coupling to obtain a bound that holds for any k < n/2(A + 1). 


11.3. Application: Variation Distance Is Nonincreasing 


We know that an ergodic Markov chain eventually converges to its stationary distri¬ 
bution. In fact, the variation distance between the state of a Markov chain and its 
stationary distribution is nonincreasing in time. To show this, we start with an interest¬ 
ing lemma that gives another useful property of the variation distance. 

Lemma 11.3: Given distributions <j x and a Y (m a state space S, let Z — (X,Y) be a 
random variable on S x S, where X is distributed according to o x and Y is distributed 
according to oy. Then 
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Pr(X 7^ > Ik.Y -or K ||. (11.1) 

Moreover, there exists a joint distribution Z = ( X.Y). where X is distributed accord¬ 
ing to a x and Y is distributed according to ay. for which equality holds. 


Again, examining a specific example (such as in Figure 11.1) helps us understand the 
following proof. 


Proof: For each ^ e 5, we have 


Hence 


and therefore 


Pr(X = K = a) < min( Pr( X — .v ). Pr( Y — x)), 
Pr(X = Y) < ^ min( Pr( X = .v ). Pr<)' = .v)). 

-\€S 


Pr(X ^Y)>\ - ^min(Pr(X = ,v).Pr(}' = a .丨丨 

A € .S' 

=y~^(Pr(X = .v ) - mini Pr( .V = .v ). Pri)' = .v))). 

-ve.S" 

Hence we are done if we can show 

\\<j x — o- Y \\ = y^(Pr( X = .v) — min( Pr( X = .v ). Pr( >' = .v n). (11.2) 

,ve.S' 

But Pr(X = x) - min(Pr(X = A"), Pr( Y — x)) — 0 w hen a x ( v) < a> (.v ). and when 
cJx(x) > cf Y {x) it is 


Pr(X = x) - Pr(K = x) = o x (.v) - (.v). 


If we let 5 + be the set of all states for which <j x (x) > cry(.v), then the right-hand side 
of Eqn. (11.2) is equal to (7^(5 + ) — (7)/(5 十 ）， which equals ||a v - o Y from the argu¬ 
ment in Lemma 11.1. This gives the first part of the lemma. 

Equality holds in Eqn. (11.1) if we take a joint distribution where X = )' as much 
as possible. Specifically, let m{x) = min(Pr(X = x), PrCK = .v)). If ^ v mix ) = 1. 
then X and Y have the same distribution and we are done. Otherwise, let Z = ( X. Y ) 
be defined by 


Pr(Z = x，F = v)= 


m{x) 

(fTx(x) - m(x)){<7 Y (y) - tn(y)) 


1 - J2 Z m (-) 


otherwise. 


The idea behind this choice of Z is to first match X and Y as much as possible and then 
force X and Y to behave independently if they do not match. 

For this choice of Z, 


Pr(X = F) = y^m(x) = 1 - ||or x - o Y \\ 


279 





COUPLING OF MARKOV CHAINS 


It remains to show that, for this choice of Z, Pr(X = x) = o x (x)\ the same argu¬ 
ment will hold for Pr(F = y). If m{x) = oxi^) then Pr(X = x,Y = x) = m(x) and 
Pr(X = x,Y = v) = 0 when x # y, so Pr(X = x) = <7 X (x). If m(x) — o Y {x), then 

Pr(X = x) = ^Pr(X = X ， F = _v) 


00 + E 


(cf X (x) - m(x))(< 7 Y (y) - m{y)) 


m{x) 


m{x) + 


⑵ 

(( 7 x(x) - m(x)) - m (y)) 

1 - 爪⑵ 

(or x (x) - m(x))(\ - o Y (x) - ( - mix])) 




m(x) + (ct x (x) - m(x)) 


= Cfx(x), 

completing the proof. 


Recall that A(t) = max x A r (/), where A x (t) is the variation distance between the sta¬ 
tionary distribution and the distribution of the .state of the Markov chain after t steps 
when starting at state x. Using Lemma 11.3, we can prove that A(r) is nonincreasing 
over time. 


Theorem 11.4: For any ergodic Markov chain M,, A(T + 1) < A(T). 

Proof: Let x be any given state, and let v be a state chosen from the stationary distri¬ 
bution. Then 


△'.(n = ii/0' r ii. 

Indeed, if X T is distributed according to pi and if Y T is distributed according to pl ，then 
by Lemma 11.3 there exists a random variable Zj = {X T , Yj) with Pr(X 7 ' ^ Y T )= 
A X (T). From this state Zj, consider any one-step coupling for the Markov chain that 
takes Z T = (X T , Y T ) to Zj +\ = (X 7+ i, Yj+\) in such a way that, whenever X T — 
Yt, the coupling makes the same move, so that Xj + \ = Y T+ \. Now X T+ \ is distributed 
according to p[ +l and Y r+ \ is distributed according to p v 7+l , which is the stationary 
distribution. Hence, by Lemma 11.3, 

A. V (D = Pr(X 7 ^ Y T ) 

> Pr(^ + i 7^ ^7' + i) 

> II^ +1 -P V 7+ '|I 
= A X (T + 1). 


The second line follows from the first because the one-step coupling assures X T+ \ = 
^ 7+1 whenever X T = Y r . The result follows since the foregoing relations hold for 
every state x. ■ 
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11.4. Geometric Convergence 


The following general result, derived from a trivial coupling, is useful for bounding the 
mixing time of some Markov chains. 

Theorem 11.5: Let P be the transition matrix for a finite, irreducible, (iperiodic Mar¬ 
kov chain. Let mj be the smallest entry in the jth column of the matrix, and let m — 
ntj- Then, for all x and t, 

\\p[ - 77-II < (1 - m)'. 

Proof: If the minimum entry in column / is m r then in one step the chain reaches state 
/ with probability at least mj from every state. Hence we can design a coupling where 
the two copies of the chain both move to state / together with probability at least m,- in 
every step. Since this holds for all y, at each step the tw o chains can be made to couple 
with probability at least m. Hence the probability the> have not coupled after m steps 
is at most (1 — m) ! , yielding the theorem via the coupling lemma. ■ 

Theorem 11.5 is not immediately helpful if there a zero entry in each column, in 
which case m = 0. In Exercise 11.6. we consider how it make it useful for an> tinite. 
irreducible, aperiodic Markov chain. Theorem 11.5 shows that, under very general 
conditions, Markov chains converge quickly to their stationary distribution、, with the 
variation distance converging geometrically in the number of steps. 

A more general related result is the following. Suppose that we can obtain an upper 
bound on r(c) for some constant c < 1/2. For example, such a bound might be found 
by a coupling. This is sufficient to bootstrap a bound for t (s) for any f > 0. 

Theorem 11.6: Let P be the transition matrix for a finite, irreducible, aperiodic Mar¬ 
kov chain M r with t(c) < T for some c < 1/2. Then, for this Markov chain, r(s) < 
"lne/ln( 2 c)ir. 


Proof: Consider any two initial states Xq = x and Yq = v. By the definition of r(c). 
we have \\p] —jt\\ < c and \\p^ —tt\\ < c. It follows that || p] - p] || < 2c and hence, 
by Lemma 11.3, there exists a random variable Zr.. x , v = (X 7 -, IV) with X T distributed 
according to p T x and Yj distributed according to pj. such that Pr( Xj 7 ^ < 2c. 

Now consider the Markov chain M' t given by the transition matrix P y . which cor¬ 
responds to a chain that takes T steps of M t for each of its steps; the Z Tx v give a 
coupling for this new chain. That is, given two copies of the chain Mj in the paired 
state (x, _v), we can let the next paired state be given by the distribution Z T x ,. which 
guarantees that the probability the two states have not coupled in one step is at most 2c. 
The probability that this coupling of the chain M' t has not coupled over 人 ' steps is then 
at most (2c.) 人 .by induction. By the coupling lemma, M' t is within variation distance e 
of its stationary distribution after k steps if 


(2c) k < £. 
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It follows that, after at most「ln e/ln(2c)'| steps, M[ is within variation distance e of 
its stationary distribution. But M; and M, have the same stationary distribution, and 
each step of M; corresponds to T steps of M t . Therefore, 


t(s) < 


In s 
\n(2c) 


T 


for the Markov chain M,. 


11.5. Application: Approximately Sampling Proper Colorings 


A vertex coloring of a graph gives each vertex v a color from a set C, which we can 
assume without loss of generality is the set {1,2,..., c}. In a proper coloring, the two 
endpoints of every edge are colored by two different colors. Any graph with maximum 
degree A can be colored properly with A +1 colors by the following procedure: choose 
an arbitrary ordering of the vertices, and color them one at a time, labeling each vertex 
with a color not already used by any of its neighbors. 

Here we are interested in sampling almost uniformly at random a proper coloring 
of a graph. We present a Markov chain Monte Carlo (MCMC) process that generates 
such a sample and then use a coupling technique to show that it is rapidly mixing. In 
the terminology of Chapter 10, this gives an FPAUS for proper colorings. Applying the 
general reduction from approximate counting to almost uniform sampling, as in The¬ 
orem 10.5, we can use the FPAUS for sampling proper colorings to obtain an FPRAS 
for the number of proper colorings. The details of this reduction are left as part of 
Exercise 11.15. 

To begin, we present a straightforward coupling that allows us to approximately 
sample colorings efficiently when there are c > 4A + 1 colors. We then show how to 
improve the coupling to reduce the number of colors necessary to 2A + 1. 

Our Markov chain on proper colorings is the simplest one possible. At each step, 
choose a vertex v uniformly at random and a color i uniformly at random. Recolor 
vertex v with color l if the new coloring is proper (that is, v does not have a neighbor 
colored £), and otherwise let the state of the chain be unchanged. This finite Markov 
chain is aperiodic because it has nonzero probability of staying in the same state. When 
c > A 十 2, it is also irreducible. To see how from any state X we can reach any other 
state Y, consider an arbitrary ordering of the vertices. Recolor the vertices in X to match 
Y in this order. If there is a conflict at any step, it must arise because a vertex v that 
needs to be colored is blocked by some other vertex v' later in the ordering. But r 
can be recolored to some other nonconflicting color, since c > △ + 2, allowing the 
process to continue. Hence, when c > A + 2, the Markov chain has a stationary dis¬ 
tribution. The fact that this stationary distribution is uniform over all proper colorings 
can be verified by applying Lemma 10.7. 

When there are 4 A + 1 colors, we use a trivial coupling on the pair of chains (X t , Y t ): 
choose the same vertex and color on both chains at each step. 
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Theorem 11.7: For any graph with n vertices and maximum degree A, the mixing time 
of the graph-coloring Markov chain satisfies 


t(s) < 




provided that c > 4A + 1. 


Proof: Let D t be the set of vertices that have different colors in the two chains at time 
t, and let d, = \D,\. At each step in which d, > 0. either d, remains at the same value 
ord, increases or decreases by at most 1. We show that d, is actually more likely to de¬ 
crease than increase; then we use this fact to bound the probability that d t is nonzero 
for sufficiently large t. 

Consider any vertex v that is colored differently in the two chains. Since the degree 
of v is at most A, there are at least c — 2A colors that do not appear on the neighbors 
of v in either of the two chains. If the vertex is recolored to one of these c — 2A colors, 
it will have the same color in both chains. Hence 

cl. c - 

Pr(d t+l =d t - 1 I d, >())>- - . 

n c 

Now consider any vertex u that is colored the same in both chain、For i to be col¬ 
ored differently at the next step, it must have some neighbor ir that is differently colored 
in the two chains; in that case, it is possible that trying to recolor r using a color that 
the neighbor w has in one of the two chains will recolor the vertex r in one chain but 
not the other. Every vertex colored differently in the two chains can affect at mo、t A 
neighbors in this way. Hence, when d, > 0, 


Pr(t/,+i = d t + \ \ d t > 0) < 


Ad t 


We find that 


E[t//+i I d t ] = d t ~\~ Pr(t//+i = d, — Pr(t//+i = d ； — 1) 


< d t ^ 


< d t 


Ad t 2 d t c — 2A 

n c n c 

c — 4A \ 


which also holds if d t = 0. 

Using the conditional expectation equality, we have 

E[d t+l ] = E[E[d t+i \d t ]]<^[d t }(\- 


c - 4A 


By induction, we find 


nd t ]<d {) ( i 


c- 4A 


Since do < n and since d t is a nonnegative integer, it follows that 
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Pv(d, > 1) < E[d { ] < /7^1 - C ::△ ) < ne~ nr ~ 4A)/Nr . 

Hence the variation distance is at most s after 

nc / n\ 

t = - In(— 

c — 4A \s ) 

steps. 


Assuming that each step of the Markov chain can be accomplished efficiently in time 
that is polynomial in n, Theorem 11.7 gives an FPAUS for proper colorings. 

Theorem 11.7 is rather wasteful. For example, when bounding the probability that 
d t decreases, we used the loose bound c - 2A. The number of colors that decrease 
d, could be much higher if some of the vertices around v have the same color in both 
chains. By being a bit more careful and slightly more clever with the coupling, we can 
improve Theorem 11.7 to hold for any c > 2A + 1. 


Theorem 11.8: Given an n-vertex graph with maximum degree A, the mixing time of 
the graph-coloring Markov chain satisfies 


t(s) < 



provided that c > 2A + 1. 


Proof: As before, let D, be the set of vertices that have different colors in the two 
chains at time t, with \ D t \ — d t . Let A, be the set of vertices that have the same color 
in the two chains at time t. For a vertex v in A,，let d'{v) be the number of vertices ad¬ 
jacent to v that are in D t \ similarly, for a vertex w in D,，let d r {w) be the number of 
vertices adjacent to w that are in A,. Note that 

d'{v ) —二 d\w), 

veA ； \re!) t 

since the two sums both count the number of edges connecting vertices in A, to vertices 
in D t . Denote this summation by m'. 

Consider the following coupling: if a vertex v e D t is chosen to be recolored, we 
simply choose the same color in both chains. That is, when v is in D t , we are using 
the same coupling we used before. The vertex v will have the same color whenever the 
color chosen is different from any color on any of the neighbors of v in both copies of 
the chain. There are c — 2A + d'(v) such colors; notice that this is a tighter bound than 
we used in the proof of Theorem 11.7. Hence the probability that d t +\ — d t — \ when 
> 0 is at least 


E 

vel), 


2A + d\v) 


((c - 2A)d, + m ). 


Assume now that the vertex to be recolored is i; g A t . In this case we change the 
coupling slightly. Recall that, in the previous coupling, recoloring a vertex v e A(t) re¬ 
sults in v becoming differently colored in the two chains if the randomly chosen color 
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(a) 



Figure 11.2: (a), original coupling; (b), improved coupling. In the original coupling of part (a), 
the gray vertex has the same color in both chains and has a neighbor with different colors in the two 
chains, one black and one white. If an attempt is made to recolor the gra\ vertex black, then the move 
will succeed in one chain but not the other, increasing d, . Similurh. if an attempt made to recolor 
the gray vertex white, then the move will succeed in one chain but not the other, giving a second move 
that increases d t . In the improved coupling ot'part (b). if the gray \erte\ is recolored white m .V ； then 
the gray vertex is recolored black in Y, and vice versa, giving just one mo\ c that increase 、 d ; . 

appears on a neighbor of v in one chain but not the other. For example: if v is colored 
green, and a neighbor uj is colored red in one chain and blue in the other, and no other 
neighbor of i; is colored red or blue in either chain, then attempting to color v either 
red or blue will cause v to be recolored in one chain but not the other. Hence there are 
two potential choices for i/s color that increase d t . 

In this specific case where just one vertex w neighboring v has different colors in 
the two chains, we could improve the coupling as follows: when we try to recolor v 
blue in the first chain, we try to recolor it red in the second chain; and when we try to 
recolor it red in the first chain, we try to recolor it blue in the second chain. Now v 
either changes color in both chains or stays the same in both chains. By changing the 
coupling, we have collapsed two potentially bad moves that increase d, into just one 
bad move. See Figure 11.2 for an example. 

More generally, if there are d'{v) differently colored vertices around then we can 
couple the colors so that at most d'{v) color choices cause d, to increase, instead of up 
to 2d'{v) choices in the original coupling. Concretely, let 5i( r) be the set of colors on 
neighbors of v in the first chain but not the second, and similarly let S：( v ) be the set 
of colors on neighbors of v in the second chain but not the first. Couple pairs of colors 
ci G S\(v) and cj G Sziv) as much as possible, so that when q is chosen in one chain 
ci is chosen in the other. Then the total number of ways to color v that increases d, is 
at most max(5i(u), S 2 (v)) < d'(v). 

As a result, the probability that 山一 \ = c/, + 1 when d t > 0 is at most 

1 (/’(r) m' 

n 匕 c cii 
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We therefore find that 


E[d t+l \d t ]<d t ( 1 


r - 2A 


Following the same reasoning as in the proof of Theorem 11.7, we have 

C-2AV 


Pr(^ > 1) < E[d t ] < 


i ] 


< ne 


-t{c-2A)/iu 


and the variation distance is at most s after 

nc 


2A ln U 


steps. 


Hence we can use the Markov chain for proper colorings to give us an FPAUS when¬ 
ever c > 2A. 


11.6. Path Coupling 


In Section 10.3 we showed that, if we can obtain an FPAUS for independent sets for 
graphs of degree at most A, then we can approximately count the number of indepen¬ 
dent sets in such graphs. Here we present a Markov chain on independent sets, together 
with a coupling argument, to prove that the chain gives such an FPAUS when A < 4. 
The coupling argument uses a further technique, path coupling. We demonstrate this 
technique specifically for the Markov chain sampling independent sets in a graph, al¬ 
though with appropriate definitions the approach can be generalized to other problems. 

Interestingly, it is very difficult to prove that the simple Markov chain for sampling 
independent sets given in Section 10.4, which removes or attempts to add a random 
vertex to the current independent set at each step, mixes quickly. Instead, we consider 
here a different Markov chain that simplifies the analysis. We assume without loss of 
generality that the graph consists of a single connected component. At each step, the 
Markov chain chooses an edge {it, i') in the graph uniformly at random. If X t is the 
independent set at time r, then the move proceeds as follows. 

• With probability 1/3, set X t+l = X t — {u, u}. (This move removes u and v, if they 
are in the set.) 

• With probability 1/3, let Y = (X, — {u}) U {u}. If Y is an independent set, then 
X t+l = Y ： otherwise, X r+ \ = X t . (This move tries to remove u if it is in the set and 
then add v.) 


• With probability 1/3, let F = (X, — {u}) U {u}. If Y is an independent set, then 
X !+[ = F; otherwise, X, +[ = X,. (This move tries to remove v if it is in the set and 
then add u.) 

It is easy to verify that the chain has a stationary distribution that is uniform on all in¬ 
dependent sets. We now use the path coupling argument to bound the mixing time of 
the chain. 
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Figure 11.3： Three cases for the independent set Markov chain. Vertices colored black are in both 
independent sets of the coupling. Vertex x is colored gray, to represent that it is a member ot' the 
independent set of one chain in the coupling but not the other. 

The idea of path coupling is to start with a coupling for pairs of states ( A '； _ > that 

differ in just one vertex. This coupling is then extended to a general coupling o\er 
all pairs of states. When it applies, path coupling is very pow erful, because it is often 
much easier to analyze the situation where the two states differ m a small \ui\ ( here, 
in just one vertex) than to analyze all possible pairs of states. 

Consider a graph G = (V 7 . E ). We sa\' that a vertex is had if it is an element of X, or 
Y, but not both; otherwise, the vertex is good. Let (/, = X: — Y , 丁丨 一 X, \. so that cl, 
counts the number of bad vertices. Assume that X, and }' r differ in exactly one vertex 
(i.e., d, = 1). We apply a simple coupling, performing the same move in both states, 
and show that under this coupling E[c/, +) \ d,} < cl, when d t = 1 or. equivalently, that 
E[d t+l -d t I d t = \]< 0. " 

Without loss of generality, let X, = I and Y, — / U {.v}. A change in d, can oc¬ 
cur only when a move involves a neighbor of x. Thus, in analyzing this coupling, we 
can restrict our discussion to moves in which the chosen random edge is adjacent to a 
neighbor of x. Let 8 Z = 1 if the vertex z x goes from good to bad between step t and 
step f 十 1. Similarly, let S x = — 1 if the vertex x goes from bad to good between step t 
and step t + 1. By linearity of expectations. 



As we shall see, in the summation we need only consider those w that are equal to .v. 
a neighbor of x, or a neighbor of a neighbor of x, since these are the only vertices that 
can change from good to bad or bad to good in one step of the chain. We shall demon¬ 
strate how to balance the moves in such a way that it becomes clear that E[c / 卜 i — (/, | 
d t = 1] < 0 as long as A < 4. 

Assume that .v has k neighbors, and let y be one of these neighbors. For each vertex 
\. that is a neighbor of x, we consider all of the moves that choose an edge adjacent to 
v. The subsequent analysis makes use of the restriction A < 4. There are three cases, 
as shown in Figure 11.3. 
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1. Suppose that v has two or more neighbors in the independent set I = X,. Then no 
move that involves v can increase the number of bad vertices, and hence d t+ \ cannot 
be larger than d t as a result of any such move. 

2 . Suppose that v has no neighbors in I. Then d t can increase by 1 if the edge (v,) 

(where 1 < / < 3) is chosen and an attempt is made to add y and remove z,-. These 
moves are successful on X, but not on Y, , and hence = 1 with probability at most 

3 • 1/3|£| = 1/|E|. No other move involving y increases d,. 

The possible gain from 5 v is balanced by moves that decrease 5 V . Any of the 
three possible moves on the edge (x, v) match the vertex x, so that 5 v = —1, and no 
other bad vertices are created. Hence S y — —l with probability at least \/\E\. We 
see that the total effect of all of these moves on | d t = 1] is 


so that the moves from this case do not increase E[d, + \ — d t \ d, = 1]. 

3. Suppose that y has one neighbor in /. If the edge (x, v) is chosen, then two moves 
can give 5 v = — 1: the move that removes both x and y, or the move that removes v 
and adds x. The third move, which tries to add v and remove x, fails in both chains 
because v has a neighbor in I. Hence 8 X = — 1 with probability at least |(1/|£'|). 

Let z be the neighbor of y in I. Both y and z can become bad in one step if the 
edge (v, z) is chosen and an attempt is made to add y and remove This move 
is successful on X, but not on Y, , causing d, to increase by 2 since 5 v and 8- both 
equal 1. No other move increases d t . Hence the probability that the number of bad 
vertices is increased in this case is 1 /31E|, and the increase is by 2. Again, the total 
effect of all of these moves on J2w \ d t = \] is 


2 •顽 面 

so that the moves from this case do not increase E[d t+ \ — d t \ d, = 1]. 


The case analysis shows that if we consider moves that involve a specific neighbor 
v, they balance so that every move that increases d t+ \ — d, is matched by correspond¬ 
ing moves that decrease d t+ \ — d,. Summing over all vertices, we can conclude that 

E[d l+l -d t I ^ = 1J = E I 4 = 1 = [E [心 I ^ = 1] < 0. 

L ", J U' 

We now use an appropriate coupling to argue that E[d I+ \ \ d t ] < d, for any pair of 
states (X t ,Y t ). The statement is trivial if d, = 0, and we have just shown it to be true 
if d, = \. If d, > \, then create a chain of states Z () , Z )? ..., Z (/ , as follows: Z 0 = X,. 
and each successive Z,- is obtained from Z,_i by either removing a vertex from X, — Y ； 
or adding a vertex from Y, — X,. This can be done, for example, by first removing all 
vertices in X t — Y, one by one and then adding vertices from Y, — X t one by one. Our 
coupling now arises as follows. When a move is made in X, = Zq, the coupling for 
the case when d, — 1 gives a corresponding move for the state Z\. This move in Z\ can 
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similarly be coupled with a move in state Z 2 , and so on, until the move in Z c i,- \ yields 
a move for Z^ f = Y,. Let Z\ be the state after the move is made from state Z ( , and let 

A(Z ； _„Z；) = |Z ； _ 1 -Z；| + |Z ； -Z；_ 1 |. 

Note that Z' 0 = X r+l and Z' dt = Y, +l . We have shown that E[d t+ \ — d t | = 1] < 0. 

so we can conclude that 

E[A(Z ； _„Z；)] < 1; 


that is, because the two states Z,_i and Z, differ in just one vertex, the expected number 
of vertices in which they differ after one step is at most 1. Using the triangle inequality 
for sets, 


\A-B\< \A-C\ + \C - B\, 


we obtain 


d, 

1^+1-^+.1 + |F, +1 - X, +1 | < - Z;| + |Z; - Z；_,|) 

/ =1 
or 

山 

d t+ i = \x t+l - y f+1 | + |K, +1 - x, +1 | < A(Z ； _,,Z；). 

/ = 1 

Hence, 


- cl ； - 

nd t + y |^] <E J]A(Z ； _,,Z；) 

-/=1 - 

(h 

= ^E[A(Z ； _ 1 ,Z；)] 

/ =1 

<d t . 

In previous examples we were able to prove a strict inequality of the form 
E[d t+l I d t ] < M, 

for some 0 < 1， and we used this strict inequality to bound the mixing time. How¬ 
ever, the weaker condition E[t/,+i | d t ] < d, that we have here is sufficient for rapid 
mixing, as we shall see in Exercise 11.7. Thus, the Markov chain gives an FPAUS for 
independent sets in graphs when the maximum degree is at most 4; as we showed in 
Section 10.3, this can be used to obtain an FPRAS for this problem. 


11.7. Exercises 

Exercise 11.1: Write a program that takes as input two positive integers " 1 and n 2 and 
two real numbers p\, p 2 with 0 < p\, P 2 < 1- The output of your program should be 
the variation distance between the binomial random variables B(n\. p\) and B(n 2 , pi ), 
rounded to the nearest thousandth. Use your program to compute the variation distance 
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Apply Theorem 11.5 to P. Then apply Theorem 11.5 to the Markov chain with transi¬ 
tion matrix P 2 , and explain the implications for the convergence of the original Markov 
chain to its stationary distribution. Which application gives better bounds on the vari¬ 
ation distance? 

Exercise 11.5: Suppose I repeatedly roll a standard six-sided die and obtain a sequence 
of independent random variables X\,Xi ,where Xj is the outcome of the /th roll. 
Let 

j 

Yj = J2 X i mod 10 

/ =i 

be the sum of the first j rolls considered modulo 10. The sequence Yj forms a Markov 
chain. Determine its stationary distribution, and determine a bound on r(e) for this 
chain. (Hint: One approach is to use the method of Exercise 11.4.) 


between the following pairs of distributions: 5(20,0.5) and 5(20,0.49); 5(20,0.5) 
and 5(21,0.5): and 5(21,0.5) and 5(21,0.49). 

Exercise 11.2: Consider the Markov chain for shuffling cards, where at each step a 
card is chosen uniformly at random and moved to the top. Suppose that, instead of 
running the chain for a fixed number of steps, we stop the chain at the first step where 
every card has been moved to the top at least once. Show that, at this stopping time, the 
state of the chain is uniformly distributed on the n ! possible permutations of the cards. 

Exercise 11.3: Consider the Markov chain for shuffling cards, where at each step a 
card is chosen uniformly at random and moved to the top. Show that, if the chain is 
run for only (1 - e)n \nn steps for some constant e > 0, then the variation distance is 
1 - 。⑴ • 

Exercise 11.4: (a) Consider the Markov chain given by the transition matrix 


-1/2 

0 

1/2 

0 

0 

0 

1/2 

1/2 

0 

0 

1/4 

1/4 

0 

1/4 

1/4 

0 

0 

1/2 

1/2 

0 

_ 0 

0 

1/2 

0 

1/2 


Explain why Theorem 11.5 is not useful when applied directly to P. Then apply The¬ 
orem 11.5 to the Markov chain with transition matrix P 2 , and explain the implications 
for the convergence of the original Markov chain to its stationary distribution. 

(b) Consider the Markov chain given by the transition matrix 

「 1/2 0 1/2 0 0 I 


5 2 

/ o / 


o 1/51/2 o 


2 5 2 2 

1717 v / 

2 5 

V wv o o 
01/50 0 


p 
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Exercise 11.6: Theorem 11.5 is useful only if there exists a nonzero entry in at least 
one column of the transition matrix P of the Markov chain. Argue that for any linite. 
aperiodic, irreducible Markov chain, there exists a time T such that every entry of P 7 
is nonzero. Explain how this can be used in conjunction with Theorem 11.5. 

Exercise 11.7: A technique we use repeatedly in the chapter is to define a distance 
function d, that represent the distance between the tw o states of our coupling after t 
steps, and then show that when d t > 0 there exists a 卢 < 1 such that 

E[^ + , 

(a) Under this condition, give an upper bound for r(f) in terms of and (/' where d* 
is the maximum distance over all possible pairs of initial states for the coupling. 

(b) Suppose that instead we have 


E[d t+{ \d t ]<d,. 

Suppose we have the additional conditions that d,^\ is one ot J : . J ； - 1. or 1 
and that Pr(d, ^ d, + {) > y. Give an upper bound for t(< c i in term、of 5. d \ and 
y. Your answer should by polynomial in and 1//, ( Him: Think ot \/； as being 
similar to a random walk on the line.) 

(c) Using (a) and (b), show that the mixing time of the coloring chain of Section 11、5 
is polynomial in the number of vertices in the graph and In < 1 丨、 e\en when the 
number of colors is only 2A. 

(d) By extending the argument of part (b), show that the mixing time of the Markov 
chain for independent sets given in Section 11.6 is polynomial in the number ol 
vertices in the graph and ln(l/e). 

Exercise 11.8: Consider the random walk on a non-bipartite. connected graph on n 
vertices, where each vertex has the same degree d > n/2. Show that 

lne 

t{e) < - • 

一 ln(l - {2d -n)/d) 

Exercise 11.9: Consider a Markov chain on n points [0,n — \\ lying in order on a cir¬ 
cle. At each step, the chain stays at the current point with probability 1 2 or mo\e.s 
to the next point in the clockwise direction with probability 1/2. Find the stationary 
distribution and show that, for any e > 0, the mixing time z{8) is 0( n : ln( 1 f )). 

Exercise 11.10: In Section 11.2.3, we suggested the following coupling Z. = i A’, 

First choose a transition for the chain X,, with v e X t and u: e \ 7 . If r € }•. use the 
same vertices v and w for the transition of the chain K,: otherwise, choose unit'ormly 
at random a vertex v' e Y, — X, and then perform the transition in the chain Y. with the 
pair i/ and w. Show that this is a valid coupling that satisfies Definition 11.3. 

Exercise 11.11: Show that the Markov chain for sampling all independent sets of size 
exactly k < n/3(A + 1) in a graph with n nodes and maximum degree A. as defined 
in Section 11.2.3, is ergodic and has a uniform stationary distribution. 
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Exercise 11.12: We wish to improve the coupling technique used in Section 11.2.3 in 
order to obtain a better bound. The improvement here is related to the technique used 
to prove Theorem 11.8. As with the coupling in Section 11.2.3, if an attempt is made to 
move u e X, — to a vertex w then the same attempt is made with the matched ver¬ 
tex in the other chain. If, however, an attempt is made to move a vertex u e X, Pi i 7 , in 
both chains, we no longer attempt to make the same move. 


(a) Assume there exists a set 5 1 ! of exactly d t ( A + 1) distinct vertices that are members 
of or neighbors of vertices in X t — Y t and, likewise, a set S 2 of exactly d r (A + 1) 
distinct vertices that are members of or neighbors of vertices in Y r — X t \ assume 
further that Sj and S\ are disjoint. Suppose that we match up the vertices in and 
S 2 in a one-to-one fashion. Argue that the moves can be coupled so that, when one 
chain attempts and fails to move d to a vertex in 5 1 in one chain, it also attempts 
and fails to move v to the matching vertex in S 2 in the other chain. Similarly, ar¬ 
gue that the moves can be coupled so that, when one chain attempts and succeeds 
in moving u to a vertex in 5i in one chain, it also attempts and succeeds in moving 
v to the matching vertex in S 2 in the other chain. Show that the coupling gives 


Pr(^r +1 =d, ^ \) < 


k — d ； (A -\- 1 ) 
~~k n 


(b) In the general case, and ^2 are not necessarily disjoint or of equal size. Show 
that in this case, by pairing up failing moves as much as possible, the number ot 
choices for w that can increase d, is max(|5i|, | 52 |) < （△十 1). Then argue that 


holds in all cases. 


Pr(^+i =d t ^\) < 


k — d t d t (A + 1) 
—k n 


(c) Use this coupling to obtain a polynomial bound on t(s) that holds for any k < 
”/2(A + l). 


Exercise 11.13: For a Markov chain with state space S and for any nonnegative inte¬ 
ger t, let 

A(0 = max 11 / 7 ；. - p{.\\. 

x.yeS - 

Assume also that the Markov chain has a stationary distribution. 

(a) Prove A(^ + 0 S A(.v)A(r) for any positive integers ^ and t. 

(b) Prove △(> 十 0 5 △⑴ A ⑴ for any positive integers 5 and t. 

(c) Prove 


△⑴ < A(r) < 2A(r) 


for any positive integer t. 

Exercise 11.14: Consider the following variation on shuffling for a deck of n cards. 
At each step, two specific cards are chosen uniformly at random from the deck, and 
their positions are exchanged. (It is possible both choices give the same card, in which 
case no change occurs.) 
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(a) Argue that the following is an equivalent process: at each step, a specific card is 
chosen uniformly at random from the deck, and a position from [\,n] is chosen 
uniformly at random; then the card at position i exchanges positions with the spe¬ 
cific card chosen. 

(b) Consider the coupling where the two choices of card and position are the same 
for both copies of the chain. Let X t be the number of cards whose positions are 
different in the two copies of the chain. Show that X ； is nonincreasing over time. 

(c) Show that 

Pr(^+i < X ; - 1 I > 0) > ^ 

(d) Argue that the expected time until X, is 0 is 0(n 2 ). regardless of the starting state 
of the two chains. 

Exercise 11.15: Modify the arguments of Lemma 10.3 and Lemma 10.4 to show that, 
if we have an FPAUS for proper colorings for any c > A t 2. then we also have an 
FPRAS for this value of r. 



Exercise 11.16: Consider the following simple Markov chain whose states are inde¬ 
pendent sets in a graph G — (V, E). To compute A", + i from X,\ 

• choose a vertex v uniformly at random from V, and flip a fair coin; 

• if the flip is heads and v e then X i+ \ = X t \ {d}: 

• if the flip is heads and v 朱 X' t , then X i+ \ = X t \ 

• if the flip is tails, v ^ X h and adding v to Xj still gives an independent set, then 
X i+[ = X, U {u}; 

•if the flip is tails and v e X,, then ^,- +! = X,. 

(a) Show that the stationary distribution of this chain is uniform over all independent 
sets. 

(b) We consider this Markov chain specifically on cycles and line graphs. For a line 

graph with n vertices, the vertices are labeled 1 to n, and there is an edge from 1 to 
2, 2 to 3, — 1 to /?. A cycle graph on n vertices is the same with the addition 

of an edge from to 1. 

Devise a coupling (X t , Y, ) for this Markov chain such that, on line graphs and 
cycle graphs, if cl, — \X, - Y,\ + \ Y, - X,\ the number of vertices on which the 
two independent sets disagree, then at each step the coupling is at least as likely to 
reduce d, as to increase d,. 

(c) With the coupling from part (b). argue that you can use this chain to obtain an 
FPAUS for independent sets on a cycle graph or line graph. You may want to use 
Exercise 11.7. 

(d) For the special cases of line graphs and cycle graphs, w e can derive exact formu¬ 
las for the number of independent sets. Deriv e exact formulas for these cases and 
prove that your formulas are correct. ( Hint: You may want to express your results 
in terms of Fibonacci numbers.) 
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Exercise 11.17: For integers a and b, an a x b grid is a graph whose vertices are all 
ordered pairs of integers (x, v) with 0 < x < a and 0 < v < b. The edges of the graph 
connect all pairs of distinct vertices (jc, v) and (x ’， .v ’） such that |jc — x’| + |)’ 一 _v’| = 
1. That is, every vertex is connected to the neighbors up, down, left, and right of it, 
where vertices on the boundary are connected to the relevant points only. Consider the 
following problems on the graph given by the 10 x 10 grid. 

(a) Implement an FPAUS to generate an e-uniform proper 10-coloring of the graph, 
where e is given as an input. Discuss how many steps your Markov chain runs for, 
what your starting state is, and any other relevant details. 

(b) Using your FPAUS as a subroutine, implement an FPRAS to generate an (e, 5)- 
approximation to the number of proper 10-colorings of the graph. Test your code 
by running it to obtain a (0.1, ().001)-approximation. (Note: This may take a sig¬ 
nificant amount of time to run.) Discuss the ordering you choose on the edges, 
how many samples are required at each step, how many steps of the Markov chain 
you perform in total throughout the process, and any other relevant details. 


Exercise 11.18: In Section 11.2.3 we considered the following Markov chain on inde¬ 
pendent sets: a move is made from the independent set X, by choosing a vertex v g 
X, uniformly at random and picking a vertex w uniformly at random from the graph. 
If X, — {u} + {w;} is an independent set, then X r+] = X r — {u} + {uj}; otherwise. 
X l+ \ — X,. We have shown that the chain converges quickly to its stationary distribu¬ 
tion via bounding x{e) by an expression that is polynomial in n and ln(l/e) whenever 
k < n/2( A + 1). Use the idea of path coupling to simplify the proof. 


Exercise 11.19: In Section 11.5, we considered a simple Markov chain for coloring. 
Suppose that we can apply the path coupling technique. (You do not need to show this.) 
In this case, we can just consider the case where d t = \. Give a simpler argument that, 
when d, = 1 and c > 2A, E[V/, + i \ d,] < fid, for some fi < \. Also show that, when 
d { = 1 and c = 2A, E[c/, + i \ d,] < d；. 
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CHAPTER TWELVE 

Martingales 


Martingales are sequences of random variables satisfying certain conditions that arise 
in numerous applications, such as random walks and gambling problems. We focus 
here on three useful analysis tools related to martingales: the martingale stopping theo¬ 
rem, Wald’s inequality, and the Azuma—Hoeffding inequality. The martingale stopping 
theorem and Wald’s equation are important tools for computing the expectation of com¬ 
pound stochastic processes. The Azuma—Hoeffding inequality is a powerful technique 
for deriving Chernoff-like tail bounds on the values of functions of dependent ran¬ 
dom variables. We conclude this chapter with applications of the Azuma—Hoeffding 
inequality to problems in pattern matching, balls and bins, and random graphs. 


12.1. Martingales 


Definition 12.1: A sequence of random variables Zo, Z i — is a martingale with re¬ 
spect to the sequence Xo, X\,... if, for all n > 0, the following conditions hold: 

• Z n is a function of X 0l X|,.. X n \ 

• E[|ZJ] < oo; 

• E[Z„_|_i I Xo,... ， U = Z". 

A sequence of random variables Zq, Z i ,... is called martingale when ii i.s a martin¬ 
gale with respect to itself. That is, E[|Z"|] < oo, and E[Z n ^\ \ Z {) .Z„J = Z n _ 

A martingale can have a finite or a countably infinite number of elements. The index¬ 
ing of the martingale sequence does not need to start at 0. In fact. in many applications 

it is more convenient to start it at 1. When we say that Z(), Z|_is a martingale with 

respect to X|, X 2 ,..., then we may consider 為 to be a constant that is omitted. 

For example, consider a gambler who plays a sequence of fair games. Let X, be the 
amount the gambler wins on the /th game (X/ is negative if the gambler loses), and let 
Z, be the gambler’s total winnings at the end of the /th game. Because each game is 
fair, E[X f ] = 0 and 
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E[Z ;+I I X^X 2 ,...,X i ] = Z i +E[X i+i ] = Z i . 

Thus, Zi, Z 2 , ..., Z„ is a martingale with respect to the sequence X\, X 2 ,..., X n . 
Interestingly, the sequence is a martingale regardless of the amount bet on each game, 
even if these amounts are dependent upon previous results. 

A Doob martingale refers to a martingale constructed using the following general 
approach. Let X 0 , X\,. ..,X n be a sequence of random variables, and let K be a ran¬ 
dom variable with E[|F|] < oo. (Generally, Y will depend on , X n .) Then 

Z,. = E[K I X 0 ,..., X,], i = 0,1, 
gives a martingale with respect to Xq, X\,..X„, since 

E[Z /+ i I X ih ...,X i ] = E[E[Y I X () ,...,X f+l ] I 
= E[r I X () ,...,X,] 


=z,_ 

Here we have used the fact that E[K | X 0 ,. • • ， ^, + | ] is itself a random variable and that 
Definition 2.7 for conditional expectation yields 

E[V I W] = E[E[V \U,W]\ \y]. 

In most applications we start the Doob martingale with Z() = E[K], which corre¬ 
sponds to Xq being a trivial random variable that is independent of Y. To understand 
the concept of the Doob martingale, assume that we want to predict the value of the 
random variable Y and that the value of K is a function of the values of the random 
variables X l7 ..., X„. The sequence Z () , Z].Z„ represents a sequence of refined es¬ 

timates of the value of K gradually using more information on the values of the random 
variables X\, Xi,.. .,X„. The first element, Z () , is just the expectation of Y. Element 
Z,- is the expected value of Y when the values of Xi，..., X, are known, and if Y is fully 
determined by X\,X n then Z n = Y, 

We now consider two examples of Doob martingales that arise in evaluating the 
properties of random graphs. Let G be a random graph from G„, p . Label the m = (^) 
possible edge slots in some arbitrary order, and let 

1 if there is an edge in the y'th edge slot, 

Xj = 

1() otherwise. 

Consider any finite-valued function F defined over graphs; for example, let F{G) 
be the size of the largest independent set in G. Now let Z () = EfFIG)] and 

Z, =E[F(G) I i = 1 ， … ， m. 

The sequence Zq, Z\,..Z m is a Doob martingale that represents the conditional ex¬ 
pectations of F{G) as we reveal whether each edge is in the graph, one edge at a time. 
This process of revealing edges gives a martingale that is commonly called the edge 
exposure martingale. 

Similarly, instead of revealing edges one at a time, we could reveal the set of edges 
connected to a given vertex, one vertex at a time. Fix an arbitrary numbering of the 
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vertices 1 through n, and let G, be the subgraph of G induced by the first i vertices. 
Then, setting Z 0 = E[F(G)] and 

Z, = E[F(G) I Gi,...,G ; ], i = 1, 

gives a Doob martingale that is commonly called the vertex exposure martingale. 

12.2. Stopping Times 

Returning to the gambler who participates in a sequence of fair gambling rounds, we 
saw in the previous section that Z|, Z 2 ,... is a martingale, where Z,- is the gambler’s 
winnings after the ith game. If the player decides (before starting to play) to quit after 
exactly k games, what are the gambler’s expected winnings? 

Lemma 12.1: If the sequence Z () , Z\” .. ， Z n is a martingale with respect to Xq, X\, 
...,X n , then 


E[Z"] = E[Z 0 ]. 

Proof: Since Z(), Z\, ... is a martingale with respect to Xo, Xi _ _ X„. it follows that 

Z, =E[Z, 一 j |X() ， •… X,]. 

Taking the expectation of both sides and using the definition of conditional expecta¬ 
tion, we have 


E[Z,J = E[E[Z^, |X„ ..… X I ]] = E[Z I ^ ] \ 


Repeating this argument yields 

E[Z"] = E[Z 0 ]. 


■ 


Thus, if the number of games played is initially fixed then the expected gain from the 
sequence of games is zero. Suppose now that the number of games played is not fixed. 
For example, the gambler could choose to play a random number of games. An even 
more complex (and realistic) situation arises when the gambler’s decision to quit play¬ 
ing is based on the outcome of the games already played. For example, the gambler 
could decide to keep playing until his winnings total at least a hundred dollars. The 
following notion proves quite powerful. 

Definition 12.2: A nonnegative, integer-valued random variable Tis u stopping time 
for the sequence {Z„, n > 0} if the event T — n depends only on the value of the ran¬ 
dom variables Zq, Zi, ..., Z„. 

A stopping time corresponds to a strategy for determining when to stop a sequence 
based only on the outcomes seen so far. For example, the first time the gambler wins 
five games in a row is a stopping time, since this can be determined by looking at the 
outcomes of the games played. Similarly, the first time the gambler has w on at least a 
hundred dollars is also a stopping time. Letting T be the last time the gambler wins 
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five games in a row, however, would not be a stopping time, since determining whether 
T — n cannot be done without knowing Z n+ \, Z n+2 , .... 

In order to fully utilize the martingale property, we need to characterize conditions 
on the stopping time T that maintain the property E[Z r ] = E[Z 0 ] = 0. It would seem, 
if the game is fair, that E[Z T ] = 0 should always hold. But consider the case where 
the gambler's stopping time is the first T such that Z T > B, where is a fixed con¬ 
stant greater than 0. In this case, the expected gain when the gambler quits playing is 
greater than 0. The subtle problem with this stopping time is that it might not be finite, 
so the gambler may never finish playing. The martingale stopping theorem shows that, 
under certain conditions and in particular when the stopping time is bounded or has 
bounded expectation, the expected value of the martingale at the stopping time is equal 
to E[Z 0 ]. We state a version of the martingale stopping theorem (sometimes called the 
optional stopping theorem) without proof. 

Theorem 12.2 [Martingale Stopping Theorem]: If Z 0 , Zi,... is a martingale with 
respect to X\, X 2 ,. • • and if T is a stopping time for X\,Xi — , then 

E[Z t ] = E[Z 0 ] 

whenever one of the following holds: 

• the are bounded, so there is a constant c such that, for all /, |Z,-1 < c\ 

• T is bounded; 

• E[7] < 00 , and there is a constant c such that E[|Z,.+| — Z,-1 | X\ __ X,] < c. 

We use the martingale stopping theorem to derive a simple solution to the gambler's 
ruin problem introduced in Section 7.2.1. Consider a sequence of independent, fair 
gambling games. In each round, a player wins a dollar with probability 1/2 or loses a 
dollar with probability 1 /2. Let Zq = 0, let X, be the amount won on the /th game, and 
let Z, be the total won by the player after / games (again, X t and Z, are negative if the 
player loses money). Assume that the player quits the game when she either loses (1 
dollars or wins i 2 dollars. What is the probability that the player wins l 2 dollars before 
losing £[ dollars? 

Let the time T be the first time the player has either won i 2 or lost i\. Then T is a 
stopping time for X\,X 2 , ‘‘‘‘ The sequence Z () , Z|,... is a martingale and, since the 
values of the Z, are clearly bounded, we can apply the martingale stopping theorem. 
We therefore have 


E[Zy] =0. 

Let q be the probability that the gambler quits playing after winning £2 dollars. Then 
E[Z 7 ] = i 2 q - t\{\ 1 ) = 0 ， 


giving 


q = 


u 

U +^2 


matching the result found in Section 7.2.1. 
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12 . 2 . 1 . Example: A Ballot Theorem 

The following ballot theorem is another application of the martingale stopping theo¬ 
rem. Suppose that two candidates run for an election. Candidate A obtains a votes, 
and candidate B obtains b < a votes. The votes are counted in a random order, chosen 
uniformly at random from all permutations on the a ^ b votes. We show that the prob¬ 
ability that candidate A is always ahead in the count is (a — /?)/(«+ b). Although this 
can be determined combinatorially, we provide an elegant martingale argument. 

Let n = a + b be the total number of votes, and let be the number of votes by 
which candidate A is leading after k votes are counted ( S k can be negative). Then S n = 
a — b. For 0 < k < n — define 

X k = - -• 

n — k 

We first show that the sequence J () , X 2 , _forms a martingale. Note that the 

sequence Xq, X\,... ,X n relates to the counting process in a backward order; X () is a 
function of S n , X„-\ is a function of S\, and so on. Consider 

E[X k I 

Conditioning on X ( ),. •. ，义人二！ is equivalent to conditioning on .S',；. ,. 兄卜々—卜 

which in turn is equivalent to conditioning on the values of the count w hen counting 
the last k - 1 votes. 

Conditioning on the number of votes that candidate A had after counting the 

first n — k -\- \ votes is 

n — /: + 1 + S n - k+ \ 

2 ' 

and the number of votes that candidate B had is 


n — 人 ’ 十 1 — S n — 


k +1 


The (/i — /: + l)th vote in the count is a random vote from among the^c first n — k ^ \ 
votes. Also, S n - k is equal to S„- k+ \ + 1 if the (n — k + 1 )th vote w as tor candidate B 
and equal to — 1 if that vote was for candidate A. Thus, for k > \. 


E[_U = (5,,-a+i + •) 


(X— 


n — k \ — S , 卜 k — 
2(" -k + \) 

,/? — A + 丨十 5",, 


k+\ 


2(n — A. + 1) 


Therefore, 




E[X k I = E 


n - k 


/： + ! 




n — k 

S„-k+\ 


兄 ，，… ，乂 ，— 


« — 十 I 
— X 々 _i ， 

showing that the sequence Xq. X\ _ _ X„^ \ is a martingale. 
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Define T to be the minimum k such that = 0 if such a k exists, and T = n — \ 
otherwise. Then 7 is a bounded stopping time, satisfying the requirements of the mar¬ 
tingale stopping theorem, and 


E[X r ] = E[X 0 ] = 
We now consider two cases. 


E[Sn) 

n 


a — b 
a + b 


Case 1: Candidate A leads throughout the count. In this case, all S n — k (and therefore 
all Xk) are positive forO<k<n — \ ,T = w — 1, and 

Xj = X„_i = = 1. 

That S\ = 1 follows because candidate A must receive the first vote in the count to 
be ahead throughout the count. 

Case 2: Candidate A does not lead throughout the count. In that case we claim that 
for some k < n — \, = 0. Candidate A clearly has more votes at the end. If can¬ 

didate B ever leads, then there must be some intermediate point k where Sk (and 
therefore X k ) is 0. In this case, T = k < n — \ and X T = 0. 


Observe that 

E[X r ]= -——- =1 - Pr(Case 1) + 0 . Pr(Case 2), 
a + b 

and thus the probability of Case 1, in which candidate A leads throughout the count, is 
(a - b)/(a + b). 

12.3. Wald’s Equation 

An important corollary of the martingale stopping theorem is known as Wald's equa¬ 
tion. Wald’s equation deals with the expectation of the sum of independent random 
variables in the case where the number of random variables being summed is itself a 
random variable. 


Theorem 12.3 [Wald’s Equation]: Let X\, X 2 , … be nonnegative ， independent, iden¬ 
tically distributed random variables with distribution X. Let T be a stopping time for 
this sequence. IfT and X have bounded expectation, then 

r T -| 

E = E[r] . E[X]. 

-/ = 1 - 

In fact, Wald’s equation holds more generally; there are different proofs of the equality 
that do not require the random variables X { , X 2 , ■■■ to be nonnegative. 

Proof: For / > 1, let . 

Z, = -E[X]). 

./ = ! 

The sequence Z\,Zi_,... is a martingale with respect to Xi, X 2 ,..., and E[Zi] =0. 
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Now, E[T] < oo and 


E[|Z, +1 — Z,'| I = E[|X ;+1 - E[X]\] < 2E[X]. 


Hence we can apply the martingale stopping theorem to compute 


E[Z r ] = E[Z 1 ] = 0. 


We now find 


E[Z t ] = E 


J2(Xj-E[X]) 

- j—\ ■ 

- T 


E[r] -E[X] 


which gives the result. 


■ 


In the case of a sequence of independent random variables, we have an equivalent, sim¬ 
pler definition of stopping time that is easier to apply. 


Definition 12.3: Let Zq, Z\ __ Z n be a sequence of independent random variables. 

A nonnegative, integer-valued random variable T is a stopping time /or the sequence 
if the event T = n is independent of Z ll+ \, Z n+ 2 ,.... 


As a simple example, consider a gambling game in which a player first rolls one stan¬ 
dard die. If the outcome of the roll is X then she rolls X new standard dice and her gain 
Z is the sum of the outcomes of the X dice. What is the expected gain of this game? 

For 1 < / < X, let Yj be the outcome of the /th die in the second round. Then 

r x 

E[Z] = E . 

• / = 1 - 

By Definition 12.3, X is a stopping time, and hence by Wald’s equality we obtain 

n\ 2 49 

E[Z]=E[X]-E[^]= (-J = 

Wald’s equation can arise in the analysis of Las Vegas algorithms, which always 
give the right answer but have variable running times, as we saw for the randomized 
algorithm for the median described in Section 3.4. In a Las Vegas algorithm we often 
repeatedly perform some randomized subroutine that may or may not return the right 
answer. We then use some deterministic checking subroutine to determine whether or 
not the answer is correct; if it is correct then the Las Vegas algorithm terminates with 
the correct answer, and otherwise the randomized subroutine is run again. If N is the 
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number of trials until a correct answer is found and if X\ is the running time for the 
two subroutines on the ith trial, then - as long as the Xj are independent and identi¬ 
cally distributed with distribution X 一 Wald’s equation gives that the expected running 
time for the algorithm is 

r N 

E =E[AH.E[XJ. 

L ; = i 」 

An example of this approach is given in Exercise 12.12. 

As another example, consider a set of n servers communicating through a shared 
channel. Time is divided into discrete slots. At each time slot, any server that needs 
to send a packet can transmit it through the channel. If exactly one packet is sent at 
that time, the transmission is successfully completed. If more than one packet is sent, 
then none are successful (and the senders detect the failure). Packets are stored in the 


server’s buffer until they are successfully transmitted. Servers follow the following 
simple protocol: at each time slot, if the server’s buffer is not empty then with proba¬ 
bility l/n it attempts to send the first packet in its buffer. Assume that servers have an 
infinite sequence of packets in their buffers. What is the expected number of time slots 
used until each server successfully sends at least one packet? 

Let N be the number of packets successfully sent until each server has successfully 
sent at least one packet. Let r, be the time slot in which the ith successfully transmit¬ 
ted packet is sent, starting from time to — 0, and let r, = r, —，/ —i. Then T, the number 
of time slots until each server successfully sends at least one packet, is given by 

N 

T = 

i=\ 

You may check that N is independent of the r,, and N is bounded in expectation; hence 
AMs a stopping time for the sequence of r,.. 

The probability that a packet is successfully sent in a given time slot is 



The r,- each have a geometric distribution with parameter p, so E[r, J = 1/p ~ e. 

Given that a packet was successfully sent at a given time slot, the sender of that 
packet is uniformly distributed among the n servers, independent of previous steps. 
Using our analysis of the expectation of the coupon collector’s problem from Chap¬ 
ter 2, we deduce that E[/V] = nH(n) — n \nn + 0{n). We now use Wald’s equalit> 
to compute 

▽ N 

E[T] = E [n 
L - 

= E[N]-E[r,] 

nH(n) 

— , 

P 

which is about e/? In n. 
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12.4. Tail Inequalities for Martingales 


Perhaps the most useful property of martingales for the analysis of algorithms is that 
C hern off-like tail inequalities can apply，even when the underlying random variables 
are not independent. The main results in this area are Azuma’s inequality and Hoeff- 
ding’s inequality. They are quite similar, so they are often together referred to as the 
Azuma—Hoeffding inequality. 


Theorem 12.4 [Azuma-Hoeffding Inequality] : Let X () . X„ be a martingale such 

that 


mi <a 

Then, for all t > 0 and any 入 > 0 ， 

Pr(|X, - X 0 | > a) < 2 e - /： 


Proof: The proof follows the same format as that for C hern oft' bounds (Section 4.2). 
We first derive an upper bound for E[e wl 丨 j. Toward that end. wc define 

Yi = X, - Xh. i = 1 ， .… r. 

Note that | ^ | < c\ and, since Xq, X\ _is a martingale. 

E[K I X {h X u ...,X l ^] = E[X i -X l ^ ] I U … J, —j] 


Now consider 


E[Xi I 為， & ， ... ，义 ,,— m.—i = o. 


E[e 叶 I 


Writing 


Yi = -Q 




and using the convexity of e aYl , we have that 


e# 5 




-e—+ 


e% + e- Q,< ^ Yj 

= - 1 - 

2 2c, 

Since E[^ | X 0 , X\ __ 丨 = 0, we have 


(e aCl - e 


E[c aY ' I X ih X u ...,X,^]<E 


、 W ( / I o — Of (' ; 


+ - — (c a(! — c ; ) I X\) t A i. X f . 




+ e 


< 


Here we have used the Taylor series expansion of e v to find 
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e «r, + e ， V 


2 


e <«c/) 2 /2 


in a manner similar to the proof of Theorem 4.7. It follows that 


X ； — X()) 


< E 


n 

/=i 
■ t-2 

n 

/=i 

r~2 

IT 






^Y, 


E[e aY ' ' I X 0 ,Xi,...,X / _ 2 ； 

Aac,) 2 /! 


< e 


al T.' k = y cj/2 


Hence, 


Pr(X, - X 0 > 入） =Pr(e Q，( ^~ Xo) > e^) 


< 


E [ e a(U 0) ] 
e a 入 

< Q a2 T.[ = l c 7 -/2-aA 


where the last inequality comes from choosing a = 入 d' = i A similar argument 
gives the bound for Pr(X r — X 0 < —A), as can be seen for example by replacing X, 
everywhere by — X/, giving the theorem. ■ 

The following corollary is often easier to apply. 

Corollary 12.5: Let Xo,X\ _ be a martingale such that, for all k > \, 

\X k -X k ^\<c. 

Then, for all t > \ and a > 0, 

?x(\X t - X {] \>XcVt) 


We now present a more general form of the Azuma-Hoeffding inequality that yields 
slightly tighter bounds in our applications. 


Theorem 12.6 [Azuma-Hoeffding Inequality]: Let X 0 , ...,X n bea martingale such 
that 

Bk 5 — ^k-\ 5 Bk + dk 

for some constants d k and for some random variables B k that may be functions of 
Xo, X\,.Xk—\. Then, for all T > 0 and any X > 0, 

Pr(|X, - X 0 | > A.) < 2e_ 2 又 2/ ([:=i<). 
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This version of the inequality generalizes the requirement of a bound on — 

The key is the gap d k between the lower and upper bounds for — X k -\. Notice that, 
when we have the bound \X k — X k ^\\ < c 人 .，this result is equivalent to Theorem 12.4 
using Bk = —c k with a gap dk = 2q. The proof is similar to that for Theorem 12.4 
and is left as Exercise 12.6. 

12.5. Applications of the Azuma-Hoeffding Inequality 
12.5.1. General Formalization 

Before giving several applications of the Azuma-Hoeffding inequality, we describe a 
useful general technique. Let us say that a function 

f(X) = f(X { .X 2 ,...,X u ) 

satisfies the Lipschitz, condition with bound c if, for any / and for any set of values 
and V,, 

\ f(x\,.x 2 , . ..,.v,,.v,_i .v ;J ) — ,/(.Vi..v ： .v,_i ， v ; . .v,*i.v„)I < c. 

That is, changing the value of any single coordinate can change the function value by 
at most c. 

Let 

Z 0 = E[/(H ..… X„)j 
and 

z k =E[/(Xi,X 2 ,...,X, ; ) I x u x 2 ,...,x k ). 

The sequence Z 0 , Zi，... is a Doob martingale, and if the X/, are independent ran¬ 
dom variables then we claim that there exist random variables B^, depending only on 
Zq, ..., Z^_i，with < Za — Z^_i < + c. The gap between the lower and upper 

bounds on Z 人 . 一 Z 人 ——| is then at most c, so the Azuma-Hoeffding inequality of Theo¬ 
rem 12.6 applies. 

We prove this for the case of discrete random variables (although the result holds 
more generally). To ease the notation, we use as shorthand for X\, X 2 , — X/,, so 
that we write 

E[/(X) I 5,] 

for 

E[f(X) I X!,X 2 ,...,X,]. 

Also, let us abuse notation and define 

./ A () = ，/ ( ^ 1 < - ■ • i k — \ + 1 ? ■ ■ • > ^ a)- 

That is, f k (X,.x) is f(X ) with the value .v in the /:th coordinate. We shall likewise 
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z k — z ,_ 1 = E[f(X) I S k ]~ E[f(X) I U. 

Hence Z 人 -—Zn is bounded above by 

supE[/(X) I S k _ u X k =x]-E[f(X) I 5,_,] 

-V 

and bounded below by 

infE[/(X) I S k ^,X k = y]- E[f(X) \ S k ^]. 

V 

(If we are dealing with random variables that can take on only a finite number of val¬ 
ues, we could use max and min in place of sup and inf.) Therefore, letting 

B k = infE[/(X) I S k ^,X k ^y]~ E[f(X) \ 

y 

if we can bound 


supE[/(X) I =x]-infE[/(X) | S k - U X k = y]<c 

.V v 

then we will have appropriately bounded the gap Z k - Z k _\. Now 

supE[/(X) I S k -uX k = x] — infE[/(X) | = v] 

.V v 

=sup(E[/(X) I S k ^\, X k = x]- E[f(X) I 5^_1, X k = >']) 

—Y，V 

= supE[f k (X,x) - f k (X,y) I 

A", \' 

Because the X, are independent, the probability of any specific set of values for 
through X n does not depend on the values of X\,.X k . Hence, for any val¬ 
ues x, _v, - 1 , — Zk-\ we have that 

E[f k (X,.x) - f k (X,y) I X, = = z k -\] 

is equal to 

^2 Pr (d+i = : 人 +i) n ... n = z")) . (fk(z,x) - f k (z,y)). 

: t+l : .n 

But 

fk(z,x) - f k (z,y) < c, 

and hence 

E[f k (X,x)~ f k (X,y) |5,_,]<c, 
giving the required bound. 

The requirement that the X, be independent random variables is essential to applying 
this general framework. Finding a counterexample when the X ； are not independent is 
left as Exercise 12.20. 
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12.5.2. Application: Pattern Matching 

In many scenarios, including examining DNA structure, a goal is to find interesting 
patterns in a sequence of characters. In this context, the phrase “interesting patterns” 
often refers to strings that occur more often than one would expect if the characters were 
simply generated randomly. This notion of “interesting” is reasonable if the number 
of occurrences of a string is concentrated around its expectation in the random model. 
We show concentration using the Azuma—Hoeffding inequality for a simple random 
model. 

Let X = (Xi,..., X,,) be a sequence of characters chosen independently and uni¬ 
formly at random from an alphabet T , where .v = | X | . Let B = (b\,,. .,bk) be a fixed 
string of k characters from E. Let F be the number of occurrences of the fixed string 
B in the random string X. Clearly, 

E[FJ = (” —“1)(|). 

We use a Doob martingale and the Azuma-Hoeffding inequality to show that, if k is 
relatively small with respect to n , then the number of occurrences of 6 in X is highly 
concentrated around its mean. 

Let 


Zo = E[FJ, 

and for 1 < / < « let 

Z, =E[F I X, ， … ， X,]. 

The sequence Z () , ... ， Z,, is a Doob martingale, and 

Z„ = F. 

Since each character in the string X can participate in no more than k possible 
matches, for any 0 < / < n we have 


|Z, 十 I — Z/1 < k. 

In other words, the value of can affect the value of F by at most k in either direc¬ 
tion, since X； + i participates in no more than k possible matches. Hence the difference 

|E[F I X,.....X^,j-E[F I X, ， •… 足 j| = |Zm — Z, 

must be at most k. Applying Theorem 12.4 yields 

Pr(|F - E[F]\ > f) < 2e -： 2nk： . 

or (from Corollary 12.5) 

Pr(|F - E{F]\ > kksJTi) < 2e ~ A： 

We can obtain slightly better bounds by applying the general framework of Theo¬ 
rem 12.6. Let F = f(X\, Xi-,.. ■ X u ). Then, by our preceding argument, changing the 
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value of any single X ； can change the value of F by at most k, and hence the function 
satisfies the Lipschitz condition with bound k. Theorem 12.6 then applies to give 

Pr(|F - E[F]\ >£) < 2t- lEl/nk \ 

improving the value in the exponent by a factor of 4. 

12.5.3. Application: Balls and Bins 

Suppose that we are throwing m balls independently and uniformly at random into n 
bins. Let X, be the random variable representing the bin into which the /th ball falls. 
Let F be the number of empty bins after the m balls are thrown. Then the sequence 

Z, =E[F I 

is a Doob martingale. We claim that F = f(X\, X 2 ,. ..,X n ) satisfies the Lipschitz 
condition with bound 1. Consider how F changes from the placement of the z'th ball. 
If the z'th ball lands in a bin on its own, then changing X-, so that the /th ball lands in 
a bin with some other ball will increase F by 1. Similarly, if the /th ball lands in a 
bin with other balls, then changing X, so that the / th ball lands in an otherwise empty 
bin decreases F by 1. In all other cases, changing X-, leaves F the same. We therefore 
obtain 

Pr(|F - E[F]\ >e)< 

by the Azuma-Hoeffding inequality of Theorem 12.6. We could also apply Theo¬ 
rem 12.4 with |Z,- + i — Z/| < 1, but this gives a slightly weaker result. Here 

E[F] = «(l-0' 

but we could obtain the concentration result without knowing E[F], 

This result can be improved by taking more care in bounding the gap between the 
bounds on Z l+ \ — Z,. This is considered in Exercise 12.19. 

12.5.4. Application: Chromatic Number 

Given a random graph G in G"."，the chromatic number /(G) is the minimum number 
of colors needed in order to color all vertices of the graph so that no adjacent vertices 
have the same color. We use the vertex exposure martingale defined in Section 12.1 to 
obtain a concentration result for /(G). 

Let G ： be the random subgraph of G induced by the set of vertices 1,..., let Z 0 = 
E[/(G)], and let 

Z/ =E[ X (G) I 

Since a vertex uses no more than one new color, again we have that the gap between Z, 
and Z ,.—1 is at most 1, so we can apply the general framework of the Azuma-Hoeffding 
inequality from Theorem 12.6. We conclude that 

Pr(|/(G) - E[ X (G)]\ > < 2e— 2 入 2 . 

This result holds even without knowing E[/(G)]. 
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12.6. Exercises 

Exercise 12.1: Show that, if Z(), Zi,..., Z /; is a martingale with respect to Xo, X\, _ 

X n , then it is also a martingale with respect to itself. 

Exercise 12.2: Let X 0 = 0 and for y > 0 let X J + \ be chosen uniformly over the real 
interval [Xj, 1], Show that, for k > 0, the sequence 


Y k = 2 k (\ - X k ) 


is a martingale. 

Exercise 12.3: Let X\,X 2 ,... be independent and identically distributed random vari¬ 
ables with expectation 0 and variance a 2 < oc. Let 


Z"= 


(K 


— no 2 . 


Show that Z|, Z 2 ,... is a martingale. 


Exercise 12.4: Consider the gambler's ruin problem, where a player plays a sequence 
of independent games, either winning one dollar with probability 1/2 or losing one dol¬ 
lar with probability 1/2. The player continues until either losing t\ dollars or winning 
dollars. Let X„ be 1 if the player wins the /?th game and —1 otherwise. Let Z n = 

(D» 2 -^ 

(a) Show that Z\. Z 2 .... is a martingale. 

(b) Let T be the stopping time when the player finishes playing. Determine E[Z T ], 

(c) Calculate E[T]. {Hint: You can use what you already know about the probability 
that the player wins.) 

Exercise 12.5: Consider the gambler's ruin problem, where now the independent 
games are such that the player either wins one dollar with probability p < 1/2 or loses 
one dollar with probability \ — p. As in Exercise 12.4, the player continues until either 
losing i\ dollars or winning i 2 dollars. Let X n be 1 if the player wins the nth game and 
—1 otherwise, and let Z„ be the player’s total winnings after n games. 


(a) Show that 


= 


(宁 


Z„ 


is a martingale with mean 1. 

(b) Determine the probability that the player wins ii dollars before losing t\ dollars. 

(c) Show that 


= Z n — (2p — l)n 


is a martingale with mean 0. 
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(d) Let T be the stopping time when the player finishes playing. Determine E[Z T ], 
and use it to determine E[7]. (Hint: You can use what you already know about 
the probability that the player wins.) 

Exercise 12.6: Prove Theorem 12.6. 

Exercise 12.7: In the bin-packing problem, we are given items with sizes a h a 2 ,... 
with 0 < cij < 1 for 1 < / < n. The goal is to pack them into the minimum number of 
bins’ with each bin being able to hold any collection of items whose total sizes sum to 
at most 1. Suppose that each of the a-, is chosen independently according to some dis¬ 
tribution (which might be different for each i). Let P be the number of bins required 
in the best packing of the resulting items. Prove that 


?v(\P - E[P]| > A) < e— 2 入 2/n . 

Exercise 12.8: Consider an «-cube with N = 2" nodes. Let 5 be a nonempty set of 
vertices on the cube, and let x be a vertex chosen uniformly at random among all ver¬ 
tices of the cube. Let D(x, S) be the minimum number of coordinates in which x and 
_y differ over all points v e S. Give a bound on 

Pr(|D(x, S) — E[D(.v, S)]| > 入 ) _ 


Exercise 12.9: In Chapter 4 we developed a tail bound for the sum of {0,1} random 
variables. We can use martingales to generalize this result for the sum of any random 
variables whose range lies in [0,1], Let X\. X 2 ,..X n be independent random vari¬ 
ables such that Pr(0 < X, < 1) = 1. If S n = X h show that 

Pr(|^-E[5„]|>A.)<2e- 2; - 2 . 

Exercise 12.10: A parking-lot attendant has mixed up n keys for n cars. The n car 
owners arrive together. The attendant gives each owner a key according to a permuta¬ 
tion chosen uniformly at random from all permutations. If an owner receives the key to 
his car, he takes it and leaves; otherwise, he returns the key to the attendant. The atten¬ 
dant now repeats the process with the remaining keys and car owners. This continues 
until all owners receive the keys to their cars. 

Let R be the number of rounds until all car owners receive the keys to their cars. We 
want to compute E[/?]. Let Xj be the number of owners who receive their car keys in 
the z'th round. Prove that 

/ 

Yi = ( 兄 -E[X,. I Xi ， ...,X,_il) 

./ = ! 

is a martingale. Use the martingale stopping theorem to compute E\R]. 


Exercise 12.11: Alice and Bob play each other in a checkers tournament, where the 
first player to win four games wins the match. The players are evenly matched, so the 
probability that each player wins each game is 1/2, independent of all other games. 
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The number of minutes for each game is uniformly distributed over the integers in the 
range [30,60], again independent of other games. What is the expected time they spend 
playing the match? 

Exercise 12.12: Consider the following extremely inefficient algorithm for sorting n 
numbers in increasing order. Start by choosing one of the n numbers uniformly at ran¬ 
dom, and placing it first. Then choose one of the remaining n — 1 numbers uniformly 
at random, and place it second. If the second number is smaller than the first, start over 
again from the beginning. Otherwise, next choose one of the remaining n — 2 num¬ 
bers uniformly at random, place it third, and so on. The algorithm starts over from the 
beginning whenever it finds that the A:th item placed is smaller than the (/: — l)th item. 
Determine the expected number of times the algorithm tries to place a number, assum¬ 
ing that the input consists of n distinct numbers. 

Exercise 12.13: Suppose that you are arranging a chain of n dominos so that, once 
you are done, you can have them all fall sequentially in a pleasing manner by knocking 
down the lead domino. Each time you try to place a domino in the chain, there is some 
chance that it falls, taking down all of the other dominos you have already carefully 
placed. In that case, you must start all over again from the very first domino. 

(a) Let us call each time you try to place a domino a trial. Each trial succeeds with 
probability p. Using Wald's equation, find the expected number of trials neces¬ 
sary before your arrangement is reads . Calculate this number of trials for n = 100 
and p = Q.\. 

(b) Suppose instead that you can break your arrangement into k components, each of 
size n/k, in such a way so that once a component is complete, it will not fall when 
you place further dominos. For example: if you have 10 components of size 10, 
then once the first component of 10 dominos are placed successfully they will not 
fall; misplacing a domino later might take down another component, but the first 
will remain ready. Find the expected number of trials necessary before your ar¬ 
rangement is ready in this case. Calculate the number of trials for n = 100, k = 
10, and /; = 0.1. and compare with your answer from part (a). 

Exercise 12.14: (a) Let Xi. X ： _be a sequence of independent exponential random 

variables, each with mean 1. Given a positive real number A:, let be defined by 

N 二 min 

That is, N is the smallest number for which the sum of the first N of the X, is larger 
than k. Use Wald’s inequality to determine E[iV]_ 

(b) Let . be a sequence of independent uniform random variables on the 

interval (0,1). Given a positive real number k with 0 < A: < 1, let jV be defined by 

N — min | /7 : X, < k 
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That is, N is the smallest number for which the product of the first N of the X,- is 
smaller than k. Determine E[iV]. (Hint: You may find Exercise 8.9 helpful.) 

Exercise 12.15: A subsequence of a string ^ is any string that can be obtained by delet¬ 
ing characters from Consider two strings x and y of length n, where each character 
in each string is independently a 0 with probability 1/2 and a 1 with probability 1/2. 
We consider the longest common subsequence of the two strings. 

(a) Show that the expected length of the longest common subsequence is greater than 
c\n and less than cin for constants C\ > 1/2 and C 2 < 1 when n is sufficiently 
large. (Any constants c\ and C 2 will do; as a challenge, you may attempt to find 
the best constants c\ and cj that you can.) 

(b) Use a martingale inequality to show that the length of the longest common subse¬ 
quence is highly concentrated around its mean. 

Exercise 12.16: Given a bag with r red balls and g green balls, suppose that we uni¬ 
formly sample n balls from the bin without replacement. Set up an appropriate martin¬ 
gale and use it to show that the number of red balls in the sample is tightly concentrated 
around nr/(r 4 - g). 

Exercise 12.17: We showed in Chapter 5 that the fraction of entries that are 0 in a 
Bloom filter is concentrated around 

km 


where m is the number of data items, k is the number of hash functions, and n is the 
size of the Bloom filter in bits. Derive a similar concentration result using a martingale 
inequality. 

Exercise 12.18: Consider a random graph from G„ ^ , where N = cn for some constant 
c > 0. Let X be the expected number of isolated vertices (i.e., vertices of degree 0). 

(a) Determine E[X]. 

(b) Show that 

Pr(|X - E[X]| > 2AV^) < 2e- 乂 2/2 _ 

{Hint: Use a martingale that reveals the locations of the edges in the graph, one at 
a time.) 

Exercise 12.19: We improve our bound from the Azuma—Hoeffding inequality for the 
problem where m balls are thrown into n bins. We let F be the number of empty bins 
after the m balls are thrown and X,- the bin in which the ith ball lands. We define Z 0 = 
E[F]and Z ; - = E[F | 

(a) Let Ai denote the number of bins that are empty after the /th ball is thrown. Show 
that in this case 
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Z/_| = A,_| (1 - - j _ 

(b) Show that, if the /th ball lands in a bin that is empty when it is thrown, then 


Z/ = — 1) 


o-;r_ 


(c) Show that, if the z'th ball lands in a bin that is not empty when it is thrown, then 

(d) Show that the Azuma-Hoeffding inequality of Theorem 12.6 applies with d t = 

(1 — l/ny 1 '- 1 '. 

(e) Using part (d), prove that 

Pr(|F - E [ 尸 ] I > A) < 2e— 入 2(2,i —L 2 — (E[/rl)2) . 


Exercise 12.20: Let f(X\,X 2 ,. .X„) satisfy the Lipschitz condition so that, for any 
i and any values X\,... ,x„ and y；, 

\f(x\,X 2 , ■ • ， x") — f(x\,x 2 ,.. y,,x /+ i,..., ) | < c. 


We set 


Z 0 = E[f(X u X 2 ,...,X n )] 
and 

Z, = E[f(X { ,X 2 ,...,X u ) I 

Give an example to show that, if the X t are not independent, then it is possible that 
|Z,. — > c. 


313 






CHAPTER THIRTEEN 


Pairwise Independence and 
Universal Hash Functions 


In this chapter we introduce and apply a limited notion of independence, known as 
Zc-wise independence, focusing in particular on the important case of pairwise indepen¬ 
dence. Applying limited dependence can allow us to reduce the amount of randomness 
used by a randomized algorithm, in some cases enabling us to convert a randomized 
algorithm into an efficient deterministic one. Limited dependence is also used in the 
design of universal and strongly universal families of hash functions, giving space- and 
time-efficient data structures. We consider why universal hash functions are effective 
in practice and show how they lead to simple perfect hash schemes. Finally, we apply 
these ideas to the design of effective and practical approximation algorithms for finding 
frequent objects in data streams, generalizing the Bloom filter data structure introduced 
in Chapter 5. 


13.1. Pairwise Independence 


Recall that in Chapter 2 we defined a set of events E\, E 2 , ■. E n to be mutually inde¬ 
pendent if, for any subset I c [1. //], 


Pr 



= f] Pr ( E ,). 

iel 


Similarly, we defined a set of random variables X\, Xo,..., X„ to be mutually indepen¬ 
dent if, for any subset I c [!,；?] and any values x ; , i e I, 



]~]Pr(X/ = x t ). 

ief 


Mutual independence is often too much to ask for. Here, we examine a more limited 
notion of independence that proves useful in many contexts: A:-wise independence. 
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Definition 13.1: 


13,1 PAIRWISE INDEPENDENCE 


/. A set of events E\, Ej,..., E„ is /c-wise independent if, for any subset I C [1, ；/] 



2. A set of random variables X\, X 2 , ■ ■ ■, X n is k-wise independent if, for any subset 
I C [\,n] with \I\ < k and for any values x,-, i e I, 

= 4 = ]JPr(X, =x,). 

3. The random variables X\, X 2 ,..., X n are said to be pairwise independent if they are 
2-wise independent. That is, for any pair i, j and any values a, b, 

Pr((X, = a) n ( \ = 6)) = Pr( X ； = a) Pr(X y ' = b). 

13.1.1. Example: A Construction of Pairwise Independent Bits 

A random bit is uniform if it assumes the values 0 and 1 with equal probability. Here 
we show how to derive m = 2 h — ] uniform pairwise independent bits from b indepen¬ 
dent, uniform random bits X\ . X h . 

Enumerate the 2 h — 1 nonempty subsets of {1.2. b) in some order, and let Sj be 

the 7 th subset in this ordering. Set 

K = ㊉ 足， 

ieS, 

where 0 is the exclusive-or operation. Equivalently, we could write this as 

Yj = ^ X, mod 2 . 

/ e Sf 

Lemma 13.1: The Yj are pairwise independent uniform bits. 

Proof: We first show that, for any nonempty set Sj, the random bit 

> :/ = ®^ 

/ € Sj 

is uniform. This follows easily using the principle of deferred decisions (see Sec¬ 
tion 1.3). Let z be the largest element of S. Then 

y,^( © ® 

\ ! - I / 
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Suppose we reveal the values for Xj for all i € — {z}. Then it is clear that the value 

of X- determines the value of and that V) will take on the values 0 and 1 with equal 
probability. 

Now consider any two variables and with their corresponding sets Sk and St ■ 
Let z be an element of St that is not in 5 a and consider, for any values c, d g {0,1}, 

Pr ⑺ =d\Y k = c). 

We claim, again by the principle of deferred decisions, that this probability is 1/2. For 
suppose that we reveal the values for X, for all / in (5 人 ' U 5^；) — {z}. Even though this 
determines the value of Y^, the value of X z will determine . The conditioning on the 
value of Yk therefore does not change that Y t is equally likely to be 0 or 1. Hence 


Pr((h = c) n (Ff — d)) — Pr(F f — d \ Y k — c) Pr(K A . — c) 
= 1/4. 


Since this holds for any values of c, cl e {0, 1}, we have proven pairwise independence. 


13 . 1 . 2 . Application: Derandomizing an Algorithm for Large Cuts 


In Chapter 6, we examined a simple randomized algorithm for finding a large cut in an 
undirected graph G = (V, E): the algorithm places each vertex on one side of the cut 
independently with probability 1/2. The expected value of a cut generated this way is 
m/2, where m is the number of edges in the graph. We also showed (in Section 6.3) 
that this algorithm could be derandomized effectively using conditional expectations. 

Here we present another way to derandomize this algorithm, using pairwise inde¬ 
pendence. This argument exemplifies the approach of derandomization using /c-wise 
independence. 

Suppose that we have a collection Y\, Y 2 . Y„ of pairwise independent bits, where 

n = \V\ is the number of vertices in the graph. We define our cut by putting all ver¬ 
tices / with [ = 0 on one side of the cut and all vertices i with 巧 =1 on the other 
side of the cut. We show that, in this case, the expected number of edges that crosses 
the cut remains m/2. That is, we do not require complete independence to analyze the 
expectation; pairwise independence suffices. 

Recall the argument of Section 6.2.1: number the edges from 1 to m, and let Z, = 1 
if the /th edge crosses the cut and Z, = 0 otherwise. Then Z = Zj is the number 
of edges crossing the cut, and 


E[Z] = E 




Let a and b be the two vertices adjacent to the / th edge. Then 


Pr(Z, = 1) = Pr(r, / Y h ) = 1/2, 
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where we have used the pairwise independence of Y (l and Y h . Hence E[Z/] = 1 /2, and 
it follows thatE[Z] = m/2. 

Now let our n pairwise independent bits Y\ . Y„ be generated from b indepen¬ 
dent, uniform random bits Xi. X ： .X/, in the manner of Lemma 13.1 (here b = 

「 log 2 (« + 1)1 )• Then E[Z] = m/2 for the resulting cut. where the sample space is 
just all the possible choices for the initial b random bits. By the probabilistic method 
(specifically. Lemma 6.2), there is some setting of the b bits that gives a cut with value 
at least m/2. We can try all possible 2 h settings for the bits to find such a cut. Since 
2 h is 0{n) and since, for each cut. the number of crossing edges can easily be calcu- 
lated in O(m) time, it follows that we can find a cut with at least m； 2 crossing edges 
deterministically in 0{mn) time. 

Although this approach does not appear to be as efficient as the derandomization of 
Section 6.3, one redeeming feature of the scheme is that it is trivial to parallelize. If 
we have sufficiently many processors available, then each of the Q.{ n ) possibilities for 
the random bits X\, Xi,..., can be assigned to a single processor, with each pos¬ 
sibility giving a cut. The parallelization reduces the running time by a factor of ^2(//) 
using 0{n) processors. In fact, using O(mn) processors, we can assign a processor for 
each combination of a specific edge with a specific sequence of random bits and then 
determine, in constant time, whether the edge crosses the cut for that setting of the ran¬ 
dom bits. After that, only 0(log n) time is necessary to collect the results and find the 
large cut. 

13 . 1 . 3 . Example: Constructing Pairwise Independent Values Modulo a Prime 

We consider another construction that provides pairwise independent values Yq, Y\,. 
y /; _i that are uniform over the values {0,1,— 1} for a prime p. Our construction 
requires only two independent, uniform values X\ and X 2 over {0,1,...,/? — 1}, from 
which we derive 


Y, = X\ ^ iXi mod p for / = 0,— 1. 

Lemma 13.2: The variables K () , Ki,.. Y p _\ are pairwise independent uniform ran¬ 
dom variables over { 0 , 1. p — 1 }. 


Proof: It is clear that each Y,- is uniform over {0,1,— 1}, again by applying the 
principle of deferred decisions. Given X 2 , the p distinct possible values for X, give /) 
distinct possible values for Y, modulo /;, each of which is equally likely. 

Now consider any two variables K, and Y } . We wish to show that, for any aj? g 

{ 0 , 1 ,..., — 1 }, 

Pr((K = a) n (K/ =/?)) = ^, 

which implies pairwise independence. The event Y, = a and Yj = b is equivalent to 
X\ + iXi = ci mod p and X\ + jXi = b mod p. 

This is a system of two equations and two unknowns with just one solution: 
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X 2 = - mod p and X\ = a - mod p. 

j - 1 j - ' l 

Since X\ and X 2 are independent and uniform over {0,1,— 1}, the result follows. 

■ 

This proof can be extended to the following useful result: given 2n independent, uni¬ 
form random bits, one can construct up to 2 n pairwise independent and uniform strings 
of n bits. The extension requires knowledge of finite fields, so we only sketch the result 
here. The setup and proof are exactly the same as for Lemma 13.2 except that, instead 
of working modulo p, we perform all arithmetic in a fixed finite field with 2" elements 
(such as the field GF(2 n ) of all polynomials with coefficients in GF{2) modulo some 
irreducible polynomial of degree n). That is, we assume a fixed one-to-one mapping 
f from strings of n bits, which can also be thought of as numbers in {0,1,..., 2 n — 1}, 
to field elements. We let 


Y l = + fd)- f(X 2 )), 

where X\ and X 2 are chosen independently and uniformly over {0,1,..., 2 n — 1}, / runs 
over the values {0,1 ,..., 2 n — 1}, and the addition and multiplication are performed 
over the field. The Y, are then pairwise independent. 

13.2. Chebyshev’s Inequality for Pairwise Independent Variables 


Pairwise independence is much weaker than mutual independence. For example, we 
can use Chernoff bounds to evaluate the tail distribution of a sum of independent ran¬ 
dom variables, but we cannot directly apply a Chernoff bound if the X,. are only pair¬ 
wise independent. However, pairwise independence is strong enough to allow for easy 
calculation of the variance of the sum, which allows for a useful application of Cheby- 
shev’s inequality. 


Theorem 13.3: Let X = '!= 1 where the are pairwise independent random 

variables. Then 

U 

Var[X] = ^Var[X ; ], 

Proof: We saw in Chapter 3 that 

p n -i n 

Var + 2 I] Cov(X ; , X ; ), 

i = \ 」 / = 1 i<j 

where 

Coy(X h Xj) = E[(X ; - E[X ; ])(X / - E^])] = E[X ; X y ] - E[X ; ]E[X ; ], 

Since X t , X 2 , ...,X n are pairwise independent, it is clear (by the same argument as 
in Theorem 3.3) that for any / ^ j we have 


E[H y ] —E [ 兄 ] E[X y ] = 0. 
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Therefore, 

II 

\ar\X] = ^Var[X]_ ■ 

j = i 

Applying Chebyshev’s inequality to the sum of pairwise independent variables yields 
the following. 

Corollary 13.4: Let X = H X,. where the Xj are pairwise independent random 
variables. Then 


Pr(|X-E[X]|>«)<^ Va ^ = 

CT 


Z ： ； = iVar[X f ] 

a 2 


13.2.1. Application: Sampling Using Fewer Random Bits 

We apply Chebyshev’s inequality for pairwise independent random variables to obtain 
a good approximation through sampling. This uses less randomness than the natural 
approach based on Chernoff bounds. 

Suppose that we have a function / : {0,1} ,? ^ [0,1] mapping n-bit vectors into real 
numbers. Let / = ([ ve f(x))/2 n be the average value of /. We want to compute 
a 1 — <5 confidence interval for /. That is, we wish to find an interval [/ — e, / + e] 
such that 


Pr(f e[f8. 

As a concrete example, suppose that we have an integrable function g : [0,1]—> 
[0,1] and that the derivative of g exists with |g’(x)| £ (7 for some fixed constant C 
over the entire interval (0, 1). We are interested in f.:= 0 g(x) dx. There may be no di¬ 
rect way to compute this integral exactly, but through sampling we can obtain a good 
estimate. If X is a uniform random variable on [0,1], then E[^(X)] = j'J =Q g(x) dx 
by the definition of the expectation of a continuous random variable. By taking the av¬ 
erage of multiple independent samples, we can approximate the integral. If our source 
of randomness generates only random bits instead of random real numbers, then we 
might approximate the integral as follows. For a string of bits .r G {0, l} 77 , we may inter¬ 
pret x as a real number .v e [0. 1] by considering it as a decimal in binary: for example. 
11001 would correspond to 0.11001 = 25/32. Let f(x) denote the value of the func¬ 
tion g at the decimal value .v. Then, for any integer i with 0 < i <2^-1. for \- e 
[i/2'\ (/' + 1)/2") we have 

It follows that 

士 E - / ^ (v)f/v - ^ I ] (/⑴ + 品). 

ag{0.1}« V 7 心 =( ) 一 .ve{().ir A - J 

By taking n sufficiently large, we can guarantee that / = v€{() ||„ f{x))/2 n dif¬ 

fers from the integral of g by at most a constant y. In this case, a confidence interval 
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[f — e, f + e] for / would yield a confidence interval [f — e — ]/, f + e + 7 ] for the 
integral of g. 

We could handle the problem of finding a confidence interval for the average value 
/by using independent samples and applying aChernoff bound. That is, suppose that 
we sample uniformly with replacement random points in { 0 , 1 } ,Z , evaluate / at all of 
these points, and take the average of our samples. This is similar to the parameter esti¬ 
mation of Section 4.2.3. Theorem 13.5 is an immediate consequence of the following 
Chernoff bound, which can be derived using Exercises 4.13 and 4.19. If Z|, Z 2 ,... ， Z," 
are independent, identically distributed, real-valued random variables with mean /x that 
take on one of a finite possible set of values in the range [ 0 , 1 ], then 
/ m 

Pr( [ Z, - mil 

Theorem 13.5: Let f: {0 ， 1}" [0,\]and f = (J] re{0 l} „ f(x))/2 n . Let X { ,..., X in 

be chosen independently and uniformly at random from {0, 1}". If m > ln(2/<5)/2e 2 , 
then 




Although the exact choice of m depends on the Chernoff bound used, in general 
this straightforward approach requires Q (ln(l/<5)/e 2 ) samples to achieve the desired 
bounds. 

A possible problem with this approach is that it requires a large number of random 
bits to be available. Each sample of / requires n independent bits, so applying The¬ 
orem 13.5 means that we need at least Q(n ln(l/<5)/e 2 ) independent, uniform random 
bits to obtain an approximation that has additive error at most s with probability at least 

1 - < 5 . 

A related problem arises when we need to record how the samples were obtained, so 
that the work can be reproduced and verified at a later time. In this case, we also need 
to store the random bits used for archival purposes. In this case, using fewer random 
bits would lessen the storage requirements. 

We can use pairwise independent samples to obtain a similar approximation using 
less randomness. Let Xi，..., X m be pairwise independent points chosen from {0,1} ,! , 
and let Y = (Y,T=\ /(D)/ w . Then EfK] = /, and we can apply Chebyshev's in¬ 
equality to obtain 



ErLiVart/fX)] 

9 9 

m z s z 
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since Var[/(X/)J < E[( f(Xj)) 2 ] < 1. We therefore find Pr(| F — f\ > s) < 8 when 
m = \J8e 2 . (In fact, one can prove that Var[/(X/)] < 1/4, giving a slightly better 
bound; this is left as Exercise 13.4.) 

Using pairwise independent samples requires more samples: 0(l/<5£ 2 ) instead of 
the 0(ln(l/5)/e 2 ) samples when they are independent. But recall from Section 13.1.3 
that we can obtain up to 2 " pairwise independent samples with just 2n uniform inde¬ 
pendent bits. Hence, as long as l/Ss 2 < 2 U . just 2n random bits suffice; this is much 
less than the number required when using completely independent samples. Usually s 
and 8 are fixed constants independent of n . and this type of estimation is quite efficient 
in terms of both the number of random bits used and the computational cost. 


13.3. Families of Universal Hash Functions 


Up to this point, when studying hash functions we modeled them as being completely 
random in the sense that, for any collection of items xi,X 2 ,... the hash values 
h(x\), h(x 2 ),. • •, h{xk) were considered uniform and independent over the range of 
the hash function. This was the framework we used to analyze hashing as a balls-and- 
bins problem in Chapter 5. The assumption of a completely random hash function 
simplifies the analysis for a theoretical study of hashing. In practice, however, com¬ 
pletely random hash functions are too expensive to compute and store, so the model 


does not fully reflect reality. 

Two approaches are commonly used to implement practical hash functions. In many 
cases, heuristic or ad hoc functions designed to appear random are used. Although these 
functions may work suitably for some applications, they generally do not have any as¬ 
sociated provable guarantees, making their use potentially risky. Another approach is 
to use hash functions for which there are some provable guarantees. We trade away the 
strong statements one can make about completely random hash functions for weaker 
statements with hash functions that are efficient to store and compute. 


We consider one of the computationally simplest classes of hash functions that pro¬ 
vide useful provable performance guarantees: universal families of hash functions. 
These functions are widely used in practice. 


Definition 13.2: Let U be a universe with \U\ > n and let V = {0. 1 . n — \}. A 

family of hash functions H fwm L' to V is said to be (-universal //: for any elements 
xi,X7,..., Xk and for a hash fund ion h chosen uniformly at random from 1-L. we have 

Pr(/;(.v,) = h (,v 2 ) = •••= h{x k )) < —. 

/z 人 - 1 

A family of hash functions % from U to V is said to be strongly 々 -universal if, for any 

elements x\, X 2 ,... ,x^, any values y\. \'2 . ’ 人 • e {0. 1.// — 1}, and a hash function 

h chosen uniformly at random from T~i. we have 

Pr((/?(xi) = vi) n n … n (/?(.'- 人 ) = 

n k 
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We will primarily be interested in 2-universal and strongly 2-universal families of hash 
functions. When we choose a hash function from a family of 2-universal hash functions, 
the probability that any two elements xi andx 2 have the same hash value is at most 1 Jn. 
In this respect, a hash function chosen from a 2-universal family acts like a random 
hash function. It does not follow, however, that for 2-universal families the probability 
of any three values X|, x 2 , and having the same hash value is at most \/n 2 , as would 
be the case if the hash values of a'i, x 2 , and x 3 were mutually independent. 

When a family is strongly 2-universal and we choose a hash function from that fam¬ 
ily, the values h{x\) and /?(X 2 ) are pairwise independent, since the probability that they 
take on any specific pair of values is 1 jn 1 . Because of this, hash functions chosen from 
a strongly 2-universal family are also known as pairwise independent hash functions. 
More generally, if a family is strongly 众 -universal and we choose a hash function from 
that family, then the values/?(X|),/?(x 2 ), ,...h{x k ) are 众 -wise independent. Notice that 
a strongly 众 -universal hash function is also 々 -universal. 

To gain some insight into the behavior of universal families of hash functions, let us 
revisit a problem we considered in the balls-and-bins framework of Chapter 5. We saw 
in Section 5.2 that, when n items are hashed into n bins by a completely random hash 
function, the maximum load is 0 (log n/log log n) with high probability. We now con¬ 
sider what bounds can be obtained on the maximum load when n items are hashed into 
n bins using a hash function chosen from a 2-universal family. 

First, consider the more general case where we have m items labeled xi,X 2 ,... ， 

For 1 < / < j < m, let X ;/ - = 1 if items .v, and Xj land in the same bin. Let X = 
be the total number of collisions between pairs of items. By the linear¬ 
ity of expectations, 


E[X] = E 


E = E e[u 

\i < j <m 」 15 / 5 〃？ 


Since our hash function is chosen from a 2-universal family, it follows that 
EfX/J = Pr(//(A ; ) = h{xj)) < - 


and hence 


_ /m\i m- 


(13.1) 


Markov’s inequality then yields 


Pr(X> — )<Pr(X>2E[X])<-. 


If we now suppose that the maximum number of items in a bin is K, then the number 
of collisions X must be at least (^). Therefore, 


Pr 


•\ m 2 


^2- 


which implies that 
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Pr(y > 

In particular, in the case where m = n, the maximum load is at most sfln with proba¬ 
bility at least 1/2. 

This result is much weaker than the one for perfectly random hash functions, but it 
is extremely general in that it holds for any 2-universal family of hash functions. The 
result will prove useful for designing perfect hash functions, as we describe in Sec¬ 
tion 13.3.3. 

13.3.1. Example: A 2-JJniversal Family of Hash Functions 

Let the universe U be the set {0, 1 . 2./// — 1 } and let the range of our hash func¬ 

tion be V = {0,1, 2,..., n — 1}, with m > n. Consider the family of hash functions 
obtained by choosing a prime p > m. letting 

h a ,i,{x) = {{ax ^ b) mod p ) mod n 


and then taking the family 


U = {h a , b \<a<p-\, 0 </;</;}. 
Notice that a cannot here take on the value 0. 


Lemma 13.6: 7i is 2-univeysal. 


Proof: We count the number of functions in V, for which two distinct elements and 
X 2 from U collide. 

First we note that, for any .v| ♦ 

ax\ +6# ciX 2 + b mod p. 

This follows because c/.vi + /) = ax 2 + b mod p implies that a(x\ — xj) = Omod p, 
yet here both a and (.vi — .v：) are nonzero modulo p. 

In fact, for every pair of values (u, v) such that u 关 v and 0 < m, v < p — there 
exists exactly one pair of values (a. b) for which ax\ + b = u mod p and axi + b = 
v mod p. This pair of equations has two unknowns, and its unique solution is given by: 

v — ll 

a = - mod p, 

A - 2 — X, 

b = u — ax\ mod p. 


Since there is exactly one hash function for each pair (a, /;), it follows that there is ex¬ 
actly one hash function in H for which 


ax\ ^ b = k mod p and axi + b = v mod p. 

Therefore, in order to bound the probability that h a h (x\) = h a 1 ,(x 2 ) when /?"./, is 
chosen uniformly at random from it suffices to count the number of pairs (w, v), 
0 < w, u < /? — 1, for which u ^ v but u = v mod«. For each choice of u there are 
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at most I" p/n~\ — 1 possible appropriate values for v, giving at most p{ |" p/n~\ — 1) < 
p(p — \)/n pairs. Each pair corresponds to one of /?(/? — 1) hash functions, so 

^ , , p(p — \)/n 1 

Pr(/? a ./ 7 (xi) = h chb (x 2 )) < — / 

pip — 1) n 

proving that l~i is 2-universal. ■ 

13.3.2. Example: A Strongly 2-Universal Family of Hash Functions 

We can apply ideas similar to those used to construct the 2-universal family of hash 
functions in Lemma 13.6 to construct strongly 2-universal families of hash functions. 
To start, suppose that both our universe U and the range V of the hash function are 
{0,1, 2,— 1} for some prime p. Now let 

^a,b(x) = (ax + b) mod /?, 

and consider the family 

^ = {h a ,b \ 0 < a,b < p - \}. 

Notice that here a can take on the value 0, in contrast with the family of hash functions 
used in Lemma 13.6. 

Lemma 13.7: l~i is strongly 2-universal. 

Proof: This is entirely similar to the proof of Lemma 13.2. For any two elements X| 
and X 2 in U and any two values 力 and V 2 in V, we need to show that 

Pr((h a b (x\) = y\) H (h a . b (x 2 ) = y 2 )) = 

The condition that both h a j y (x\) = _vi and h a b (x 2 ) = >-2 yields two equations mod¬ 
ulo p with two unknowns, the values for a and b: ax\ + b = y\ mod p and axi + b = 
}h mod p. This system of two equations and two unknowns has just one solution: 

vz - >'l , 

a — - mod p, 

xi - Xi 

b = y \ — ax\ mod p. 

Hence only one choice of the pair (a, b) out of the p 2 possibilities results in xi and .¥2 
hashing to yi and .y 2 , proving that 

PT((h a ^ b (x\) = vi) n (h a . h (x 2 ) = y 2 )) = 

P 2 

as required. ■ 

Although this gives a strongly 2-universal hash family, the restriction that the universe 
U and the range V be the same makes the result almost useless; usually we want to 
hash a large universe into a much smaller range. We can extend the construction in a 
natural way that allows much larger universes. Let V = {0,1, 2,..., - 1}, but now 
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Although we have described both the 2-universal and the strongly 2-universal hash fam¬ 
ilies in terms of arithmetic modulo a prime number, we could extend these techniques 
to work over general finite fields - in particular, fields with 2" elements represented 
by sequences of n bits. The extension requires knowledge of finite fields, so we just 
sketch the result here. The setup and proof are exactly the same as for Lemma 13.8 ex¬ 
cept that, instead of working modulo p, we perform all arithmetic in a fixed finite field 
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with 2" elements. We assume a fixed one-to-one mapping / from strings of n bits, 
which can also be thought of as numbers in {0,1,, 2 n — 1}, to field elements. We let 


ha.biu) — / _l 



f(ai)_ f(u,) + f(b) 


where the a,, and b are chosen independently and uniformly over {0,1,..., 2 n — 1} and 
where the addition and multiplication are performed over the field. This gives a strongly 
2-universal hash function with a range of size 2 n . 


13 . 3 . 3 . Application: Perfect Hashing 

Perfect hashing is an efficient data structure for storing a static dictionary. In a static 
dictionary, items are permanently stored in a table. Once the items are stored, the table 
is used only for search operations: a search for an item gives the location of the item 
in the table or returns that the item is not in the table. 

Suppose that a set S of m items is hashed into a table of n bins, using a hash func¬ 
tion from a 2-universal family and chain hashing. In chain hashing (see Section 5.5.1), 
items hashed to the same bin are kept in a linked list. The number of operations for 
looking up an item x is proportional to the number of items in x’s bin. We have the 
following simple bound. 


Lemma 13.9: Assume that m elements are hashed into an n-bin chain hashing table 
by using a hash function h chosen uniformly at random from a 2-universal family. For 
an arbitrary element x, let X be the number of items at the bin h(x). Then 


E[X] 


m jn 


if x 穿乂 


1 + (m — \ )/n if x e S. 


Proof: Let X, = 1 if the iih element of S (under some arbitrary ordering) is in the 
same bin as .r and 0 otherwise. Because the hash function is chosen from a 2-universal 
family, it follows that 

Pr(X, = 1) = \/n. 


Then the first result follows from 


E[X] = E =X]E[X ; ]<^, 

- /=i / = i 

where we have used the universality of the hash function to conclude that E[X 7 ] < 
\/n. Similarly, if x is an element of S then (without loss of generality) let it be the first 
element of S. Hence X\ = 1, and again 

Pr(X 7 - = 1) = 1/n 


when i ^ \. Therefore, 


E[X] = E = 1 + ^E[X 7 ] 
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Lemma 13.9 shows that the average performance of hashing when using a hash func¬ 
tion from a 2-universal family is good, since the time to look through a bin of any 
item is bounded by a small number. For instance, if m = n then, when searching the 
hash table for x, the expected number of items other than .v that must be examined is 
at most 1. However, this does not give us a bound on the w orst-case time of a lookup. 
Some bin may contain y/n elements or more, and a search for one of these elements 
requires a much longer lookup time. 

This motivates the idea of perfect hashing. Given a set S. w e would like to construct 
a hash table that gives excellent worst-case performance. Specifically, by perfect hash¬ 
ing we mean that only a constant number of operations are required to find an item in 
a hash table (or to determine that it isn't there). 

We first show that perfect hashing is easy if we are given sufficient space for the 
hash table and a suitable 2-universal family of hash functions. 


Lemma 13.10: If h eH is chosen uniformly at random from a 2-universal family of 
hash functions mapping the universe i to [0. n — 1] then, for any set S <Z U of size m, 
the probability of h being perfect is at least 1 2 when n > m 2 . 

Proof: Let 夕卜心 . s nI be the in items of 5. Let X,j be 1 if the h(s,-) = h(sj) and 

0 otherwise. Let X = m X". Then, as we saw in Eqn. (13.1), the expected 

number of collisions when using a 2-universal hash function is 

E[X] = E x ” 

L 1 < 卜 ' ./ 二川 . 

Markov’s inequality then yields 

Pr(X > < Pr(X > 2E[X]) < 

Hence, when n > m 1 . we find X < 1 with probability at least 1/2. This implies that a 
randomly chosen hash function is perfect with probability at least 1/2. ■ 

To find a perfect hash function \\ hen n > we may simply try hash functions chosen 
uniformly at random from the 2-universal family until we find one with no collisions. 
This gives a Las Vegas algorithm. On average we need to try at most two hash functions. 

We would like to have perfect hashing without requiring space for ^2 (m 2 ) bins to 
store the set of m items. We can use a two-level scheme that accomplishes perfect 
hashing using only 0{m) bins. First, we hash the set into a hash table with m bins us¬ 
ing a hash function from a 2-universal family. Some of these bins will have collisions. 
For each such bin, we provide a second hash function from an appropriate 2-universal 
family and an entirely separate second hash table. If the bin has k > 1 items in it then 
we use k 2 bins in the secondary hash table. We have already shown in Lemma 13.10 
that with k 2 bins we can find a hash function from a 2-universal family that will give 
no collisions. It remains to show that, by carefully choosing the first hash function, we 
can guarantee that the total space used by the algorithm is only O(m). 


E E [〜 


m 

In 
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Theorem 13.11: The two-level approach gives a perfect hashing scheme for m items 
using O(m) bins. 


Proof: As we showed in Lemma 13.10, the number of collisions X in the first stage 
satisfies 

/ m 2 \ 1 

Pr( X > — j < Pr(X > 2E[X]) < 

When n = m, this implies that the probability of having more than m collisions is at 
most 1/2. Using the probabilistic method, there exists a choice of hash function from 
the 2-universal family in the first stage that gives at most m collisions. In fact, such 
a hash function can be found efficiently by trying hash functions chosen uniformly 
at random from the 2-universal family, giving a Las Vegas algorithm. We may there¬ 
fore assume that we have found a hash function for the first stage that gives at most m 
collisions. 

Let Cj be the number of items in the /th bin. Then there are ( 2 ) collisions between 
items in the /th bin, so 


< m. 

For each bin with c, > 1 items, we find a second hash function that gives no collisions 
using space cj. Again, for each bin, this hash function can be found using a Las Vegas 
algorithm. The total number of bins used is then bounded above by 



m 川 / \ "/ 

m + <m +2^Qj + 


< m + 2m + w = Am. 


Hence, the total number of bins used is only O(m). 


13.4. Application: Finding Heavy Hitters in Data Streams 


A router forwards packets through a network. At the end of the day, a natural ques¬ 
tion for a network administrator to ask is whether the number of bytes traveling from 
a source ^ to a destination d that have passed through the router is larger than a prede¬ 
termined threshold value. We call such a source-destination pair a heavy hitter. 

When designing an algorithm for finding heavy hitters, we must keep in mind the 
restrictions of the router. Routers have very little memory and so cannot keep a count 
for each possible pair .v and d, since there are simply too many such pairs. Also, routers 
must forward packets quickly, so the router must perform only a small number of com¬ 
putational operations for each packet. We present a randomized data structure that is 
appropriate even with these limitations. The data structure requires a threshold all 
source-destination pairs that are responsible for at least q total bytes are considered 
heavy hitters. Usually q is some fixed percentage, such as 1%, of the total expected 
daily traffic. At the end of the day, the data structure gives a list of possible heavy hit¬ 
ters. All true heavy hitters (responsible for at least q bytes) are listed, but some other 
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pairs may also appear in the list. Two other input constants, e and 5. are used to control 
what extraneous pairs might be put in the list of heavy hitters. Suppose that Q repre¬ 
sents the total number of bytes over the course of the day. Our data structure has the 
guarantee that any source-destination pair that constitutes less than q — sQ bytes of 
traffic is listed with probability at most < 5 . In other words, all heavy hitters are listed; 
all pairs that are sufficiently far from being a heavy hitter are listed with probability at 
most 5; pairs that are close to heavy hitters may or may not be listed. 

This router example is typical of many situations where one wants to keep a suc¬ 
cinct summary of a large data stream. In most chit a stream models, large amounts of 
data arrive sequentially in small blocks, and each block must be processed before the 
next block arrives. In the setting of network routers, each block is generally a packet. 
The amount of data being handled is often so large and the time between arrivals is so 
small that algorithms and data structures that use only a small amount of memory and 
computation per block are required. 

We can use a variation of a Bloom tilter. discussed in Section 5.5.3, to solve this 
problem. Unlike our solution there, which assumed the availability of completely ran¬ 
dom hash functions, here ue obtain strong, provable bounds using only a family of 
2-universal hash functions. This is important, because efficiency in the router setting 
demands the use of only veiy simple hash functions that are easy to compute, yet at the 
same time we want provable performance guarantees. 

We refer to our data structure as a coimt-min filter. The count-min filter processes 

a sequential stream of pairs X,. .Y ： _of the form X, = (/,,<：：,), where i, is an item 

and c, > 0 is an integer count increment. In our routing setting, would be the pair 
of source-destination addresses of a packet and c, would be the number of bytes in the 
packet. Let 


Count! /. T) = ^ c,. 

I :i-=i.\<t<T 

That is. Count(/, T ) is the total count associated with an item i up to time T. In the 
routing setting. Count (/. T) would be the total number of bytes associated with packets 
with an address pair / up to time T. The count-min filter keeps a running approxima¬ 
tion of Count ( 7 , T) for all items / and all times T in such a way that it can track heavy 
hitters. 

A count-min filter consist of m counters. We assume henceforth that our counters 
have sufficiently many bits that w e do not need to worry about overflow : in many prac¬ 
tical situations, 32-bit counters will suffice and are convenient for implementation. A 
count-min filter uses k hash functions. We split the counters into k disjoint groups 
Gi, G 2 ,..., Ga of size m/k. For convenience, we assume in u hat follow s that k di¬ 
vides m evenly. We label the counters by C a r where \ < a < k and ()<,/< m/k — 1. 
so that corresponds to the /th counter in the alh group. That is. w e can think of 
our counters as being organized in a 2-dimensional array, with m/k counters per row 
and k columns. Our hash functions should map items from the universe into counters, 
so we have hash functions H a for \ < a < k, where H a : U [0. m/k — 1 ]. That is, 
each of the k hash functions takes an item from the universe and maps it into a num¬ 
ber [0,m/k — 1], Equivalently, we can think of each hash function as taking an item 
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i and mapping it to the counter C ik h (I u)- The H a should be chosen independently and 
uniformly at random from a 2-universal hash family. 

We use our counters to keep track of an approximation of Count(/, T). Initially, all 
the counters are set to 0. To process a pair (/,, c,), we compute H a (i r ) for each a with 
\ < a < k and increment C a ,H a a t ) by c t . Let C a _ j(T) be the value of the counter C u . j 
after processing X] through Xj. We claim that, for any item, the smallest counter as¬ 
sociated with that item is an upper bound on its count, and with bounded probability 
the smallest counter associated with that item is off by no more than s times the to¬ 
tal count of all the pairs (i r , c,) processed up to that point. Specifically, we have the 
following theorem. 

Theorem 13.12: For any i in the universe U and for any sequence (/i,c'i), 
min C a ,(T) > Count(/, T). 

j = H a (i)A<a<k 

Furthermore, with probability 1 — (k/ms) k over the choice of hash functions, 

r 

min /(T) < Count(/, T)+ £- c t . 

j=H a {i).\<a<k ' ^ 

Proof: The first bound. 


min ； {T) > Count(/, T), 

/=""(/■) 、 \<a<k ’ 

is trivial. Each counter C a ^ ■, with / = H a (i) is incremented by c t when the pair (/,c r ) is 
seen in the stream. It follows that the value of each such counter is at least Count(/, T) 
at any time T. 

For the second bound, consider any specific i and T. We first consider the specific 
counter Ci./y l(/) and then use symmetry. We know that the value of this counter is at 
least Count(/. T) after the first T pairs. Let the random variable Z, be the amount the 
counter is incremented owing to items other than i. Let X t be a random variable that is 
1 if i t ^ i and H\(i r ) = H\{i)\ X, is 0 otherwise. Then 

T 

Zi = [ c r = X,c r . 

r.\<t<T.i,^i t = \ 


Because H\ is chosen from a 2-universal family, for any i, ^ i we have 


and hence 


It follows that 


Pr( // 1 ( /^//.(O) 


E[X r ] 


E[Z,] = E 
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By Markov’s inequality, 

Pr(z, < — = —. (13.2) 

\ ^ / s ms 

Let Z 2 , Z 3 ,..., Zk be corresponding random variables for each of the other hash func¬ 
tions. By symmetry, all of the Z, satisfy the probabilistic bound of Eqn. (13.2). More¬ 
over, the Z,. are independent, since the hash functions are chosen independently from 
the family of hash functions. Hence 


/ 人 1 \ k / T \ 

Pr(mi?Z y > e = fl Pr ( Z/ ^ e I] 

、 J — i=\ ) / = 1 v / =1 ) 

(13.3) 

£ ( 丄 y. 

V mF ) 

(13.4) 

■ 


It is easy to check using calculus that {k/nu ) K is minimized \\hen k = iue/q. in which 
case 



Of course, k needs to be chosen so that k and m : k are integers, but this does not sub¬ 
stantially affect the probability bounds. 

We can use a count-min filter to track heavy hitters in the routing setting as follows. 
When a pair (i r , c；) arrives, w e update the count-min filter. If the minimum hash 
value associated with i T is at least the threshold q for heavy hitters, then we put the 
item into a list of potential heavy hitters. We do not concern ourselves with the details 
of performing operations on this list, but note that it can be organized to allow updates 
and searches in time logarithmic in its size by using standard balanced search-tree data 
structures; alternatively, it could be organized in a large array or a hash table. 

Recall that we use Q to represent the total traffic at the end of the day. 

Corollary 13.13: Suppose that we use a count-min filter with k = |~ In hash func¬ 
tions, w =「ln I] • 「爹 ] counters, and a threshold q. Then all heavy hitters are put on 
the list, and any source-destination pair that corresponds to fewer than q — eQ bytes 
is put on the list with probability at most 8. 

Proof: Since counts increase over time, we can simply consider the situation at the end 
of the day. By Theorem 13.12, the count-min filter will ensure that all true heavy hitters 
are put on the list, since the smallest counter value for a true heavy hitter will be at least 
q. Further, by Theorem 13.12, the smallest counter value for any source-destination 
pair that corresponds to fewer than q — sQ bytes reaches q with probability at most 
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Figure 13.1: An item comes in, and 3 is to be added to the count. The initial state is on the left; the 
shaded counters need to be updated. Using conservative update, the minimum counter value 4 deter¬ 
mines that all corresponding counters need to be pushed up to at least 4 + 3 = 7, The resulting state 
after the update is shown on the right. 

The count-min filter is very efficient in terms of using only limited randomness in its 
hash functions, only 0(^ In |) counters, and only 0(\n |) computations to process 
each item. (Additional computation and space might be required to handle the list of 
potential heavy hitters, depending on its representation.) 

Before ending our discussion of the count-min filter, we describe a simple improve¬ 
ment known as conservative update that often works well in practice, although it is 
difficult to analyze. When a pair (i t ,c r ) arrives, our original count-min filter adds c r to 
each counter C u _ / that the item i hashes to, thereby guaranteeing that 

min C a ； (T) > Count(/, T) 

j = H a (i).\<a<k 

holds for all i and T. In fact, this can often be guaranteed without adding c r to each 
counter. Consider the state after the (f — l)th pair has been processed. Suppose that, 
inductively, up to that point we have, for all /, 

min C u ,j(t - 1) > Count(/,/ - 1). 

Then, when (i • 卜 ( 、 ) arrives, we need to ensure that 

C u ,j(t) > Count (/ r ,r) 

for all counters, where j = a < \ < k. But 

Count(/ ? , t) = Count(/,, r — 1) + c, < min C a ，(/ — 1) + c r . 

Hence we can look at the minimum counter value v obtained from the k counters that 
i! hashes to, add c, to that value, and increase to u + c r any counter that is smaller than 
v c r . An example is given in Figure 13.1. An item arrives with a count of 3; at the 
time of arrival, the smallest counter associated with the item has value 4. It follows that 
the count for this item is at most 7, so we can update all associated counters to ensure 
they are all at least 7. In general, if all the counters i, hashes to are equal, conservative 
update is equivalent to just adding c, to each counter. When the i r are not all equal, the 
conservative update improvement adds less to some of the counters, which will tend to 
reduce the errors that the filter produces. 
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13.5. Exercises 

Exercise 13.1: A fair coin is flipped n times. Let X ir with 1 < / < y < «, be 1 if the 
?th and jth flip landed on the same side; let X- n = 0 otherwise. Show that the X ;』 are 
pairwise independent but not independent. 

Exercise 13.2: (a) Let X and Y be numbers that are chosen independently and uni¬ 
formly at random from {(). 1.;?}. Let Z be their sum modulo ;i + 1. Show that X, Y, 

and Z are pairwise independent but not independent. 

(b) Extend this example to give a collection of random variables that are (-wise in¬ 
dependent but not (k -f 1 )-\vise independent. 

Exercise 13.3: For any family of hash functions from a finite set t/ to a finite set V, 
show that, when h is chosen at random from that family of hash functions, there exists 
a pair of elements .v and v such that 


1 1 

Pr(//(.v) = h{\)) > — - 

|V| \U\ 

This result should not depend on how the function h is chosen from the family. 

Exercise 13.4: Show that, for any discrete random variable X that takes on values in 
the range [0,1]. Var[X] < 1/4. 

Exercise 13.5: Suppose ue have a randomized algorithm Test for testing whether a 
string appears in a language L that works as follows. Given an input x, the algorithm 
Test chooses a random integer r uniformly from the set 5 = {0,1,...,/? — 1} for some 
prime p. If x is in the language, then Test(.v.r) = 1 for at least half of the possible val¬ 
ues of r. A value of r such that Test (a - , r) = 1 is called a witness for x. If x is not in the 
language, then Test(.v, / •) = 0 always. 

If we run the algorithm Test twice on an input x e L by choosing two numbers r] 
and ri independently and uniformly from S and evaluating Test(x, r\) and Test(x, ri ). 
then we find a witness with probability at least 3/4. Argue that we can obtain a witness 
with probability at least 1 — 1 // using the same amount of randomness by letting s,= 
r\i + ri mod p and evaluating Test(.v. s, ) for values 0 < i < t < p. 

Exercise 13.6: Our analysis of Bucket sort in Section 5.2.2 assumed that n elements 
were chosen independently and uniformly at random from the range [0. 2 k ). Suppose 
instead that n elements are chosen uniformly from the range [ 0 . 2 k ) in such a way that 
they are only pairwise independent. Show that, under these conditions. Bucket sort 
still requires linear expected time. 

Exercise 13.7: (a) We have shown that the maximum load w hen n items are hashed 
into n bins using a hash function chosen from a 2 -universal family of hash functions 
is at most V2n with probability at least 1/2. Generalize this argument to ('-universal 
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hash functions. That is, find a value such that the probability that the maximum load is 
larger than that value is at most 1/2. 

(b) In Lemma 5.1 we showed that, under the standard balls-and-bins model, the 
maximum load when n balls are thrown independently and uniformly at random into 
n bins is at most 3 ln«/ln \nn with probability 1 - \/n. Find the smallest value of k 
such that the maximum load is at most 3 \nn/\n\nn with probability at least 1/2 when 
choosing a hash function from a ^-universal family. 


Exercise 13.8: We can generalize the problem of finding a large cut to finding a large 
k-cut. A k-cut is a partition of the vertices into k disjoint sets, and the value of a cut is 
the weight of all edges crossing from one of the k sets to another. In Section 13.1.2 we 
considered 2-cuts when all edges had the same weight 1, and we showed how to deran- 
domize the standard randomized algorithm using collections of n pairwise independent 
bits. Explain how this derandomization could be generalized to obtain a polynomial 
time algorithm for 3-cuts, and give the running time for your algorithm. (Hint: You 
may want to use a hash function of the type found in Section 13.3.2.) 


Exercise 13.9: Suppose we are given m vectors Di, D 2 , • • • ，云三 {0,1} (: such that any k 

of the m vectors are linearly independent modulo 2. Let u, = (u/,i, 1^.2 __ Let 

ii be chosen uniformly at random from {0,1}、and let X, = ^)=\ v i-j u i mod 2. Show 
that the X, are uniform, A:-wise independent bits. 

Exercise 13.10: We examine a specific way in which 2-universal hash functions dif¬ 
fer from completely random hash functions. Let 5 = {0,1,2,.. ,,k}, and consider a 
hash function h with range {0, 1 ，2 , — 1} for some prime p much larger than k. 

Consider the values h(0), h (\),..., h(k). If h is a completely random hash function, 
then the probability that h(0) is smaller than any of the other values is roughly l/ik + l). 
(There may be a tie for the smallest value, so the probability that any h(i) is the unique 
smallest value is slightly less than \/(k 4 - 1).) Now consider a hash function h chosen 
uniformly from the family 


n={h (l . b \0<a,b<p-\} 

of Section 13.3.2. Estimate the probability that h(0) is smaller than h(\),. ..,h(k) by 
randomly choosing 10,000 hash functions from h and computing h(x) for all x e S. 
Run this experiment for k = 32 and k = 128, using primes p = 5,023,309 and p = 
10,570,849. Is your estimate close to \/{k + 1)? 


Exercise 13.11: In a multi-set, each element can appear multiple times. Suppose that 
we have two multi-sets, 5] and S 2 , consisting of positive integers. We want to test if 
the two sets are the “same” — that is, if each item appears the same number of times in 
each set. One way of doing this is to sort both sets and then compare the sets in sorted 
order. This takes 0{n log«) time if each multi-set contains n elements. 

(a) Consider the following algorithm. Hash each element of 5] into a hash table with 
cn counters; the counters are initially 0, and the z'th counter is incremented each 
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time the hash value of an element is i. Using another table of the same size and 
using the same hash function, do the same for 义 .If the /th counter in the first 
table matches the /th counter in the second table for all , report that the sets are 
the same, and otherwise report that the sets are different. 

Analyze the running time and error probability of this algorithm, assuming that 
the hash function is chosen from a 2-universal family. Explain how this algorithm 
can be extended to a Monte Carlo algorithm, and analyze the trade-off between its 
running time and its error probability. 

(b) We can also design a Las Vegas algorithm for this problem. Now each entry in the 
hash table corresponds to a linked list of counters. Each entry holds a list of the 
number of occurrences of each element that hashes to that location; this list can be 
kept in sorted order. Again, we create a hash table for 5] and a hash table for S 2 , 
and we test after hashing if the resulting tables are equal. 

Argue that this algorithm requires only linear expected time using only linear 
space. 

Exercise 13.12: In Section 13.3.1 we showed that the family 

^ = {ha.b I 1 < « < /? - 1- 0 < /； < /；} 

is 2-universal when p > iu where 

h a jAx) = ({ax 卞 /)) mod/) 1 mod ". 

Consider now the hash functions 

h a {.\) = (ax mod p ) mod n 

and the family 

n'={h a \\<a<p-\}. 

Give an example to show that H' is not 2-universal. Then prove that H' is almost 

2 -universal in the following sense: for any .v, v e {0,1,2,— 1}, if h is chosen uni¬ 
formly at random from H' then 


2 

Pr(/z(x) = h(x)) < -. 

' n 

Exercise 13.13: In describing count-min filters, we assumed that the data stream con¬ 
sisted of pairs of the form ( 乂 ， （.,），where i, was an item and c, > 0 an integer count 
increment. Suppose that one were also allowed to decrement the count for an item, so 
that the stream could include pairs of the form ( i ,, c ,) with c, <0. We could require 
that the total count for an item / , 

Count(/, T) — ^ o, 

f.i,=i.\<!<r 

always be positive. 

Explain how you could modify or otherwise use count-min filters to find heavy hit¬ 
ters in this situation. 
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CHAPTER FOURTEEN* 

Balanced Allocations 


In this chapter we examine a simple yet powerful variant of the classic balls-and-bins 
paradigm, with applications to hashing and dynamic resource allocation. 


14.1. The Power of Two Choices 


Suppose that we sequentially place n balls into n bins by putting each ball into a bin 
chosen independently and uniformly at random. We studied this classic balls-and-bins 
problem in Chapter 5. There we showed that, at the end of the process, the most balls 
in any bin - the maximum load-is 0 (ln«/lnln«) with high probability. 

In a variant of the process, each ball comes with d possible destination bins, each 
chosen independently and uniformly at random, and is placed in the least full bin 
among the d possible locations at the time of the placement. The original balls-and- 
bins process corresponds to the case where d = \. Surprisingly, even when d = 2, the 
behavior is completely different: when the process terminates, the maximum load is 
In Inn/In 2+ 0(1) with high probability. Thus, an apparently minor change in the ran¬ 
dom allocation process results in an exponential decrease in the maximum load. We 
may then ask what happens if each ball has three choices; perhaps the resulting load is 
then 0(ln In In n). We shall consider the general case of d choices per ball and show 
that, when d > 2, with high probability the maximum load is \n\nn/\nd + 0 ( 1 ). 
Although having more than two choices does reduce the maximum load, the reduction 
changes it by only a constant factor, so it remains 0 (In Inn). 

14.1.1. The Upper Bound 


Theorem 14.1: Suppose that n balls are sequentially placed into n bins in the follow¬ 
ing manner. For each ball, d > 2 bins are chosen independently and uniformly at 
random (with replacement). Each ball is placed in the least full of the d bins at the 
time of the placement, with ties broken randomly. After all the balls are placed, the 
maximum load of any bin is at most \n\nn/\nd + 0(1) with probability 1 — o(\/n). 
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The proof is rather technical, so before beginning we informally sketch the main points. 
In order to bound the maximum load, we need to approximately bound the number of 
bins with i balls for all values of i. In fact, for any given /, instead of trying to bound 
the number of bins with load exactly i, it will be easier to bound the number of bins 
with load at least i. The argument proceeds via what is, for the most part, a straight¬ 
forward induction. We wish to find a sequence of values ^ such that the number of 
bins with load at least i is bounded above by ^ with high probability. 

Suppose that we knew that, over the entire course of the process, the number of bins 
with load at least i was bounded above by 爲 • Let us consider how we would deter¬ 
mine an appropriate inductive bound for ^ i+ \ that holds with high probability. Define 
the height of a ball to be one more than the number of balls already in the bin in which 
the ball is placed. That is, if we think of balls as being stacked in the bin by order of 
arrival, the height of a ball is its position in the stack. The number of balls of height at 
least i + 1 gives an upper bound for the number of bins with at least i + 1 balls. 

A ball will have height at least i + 1 only if each of its d choices for a bin has load 
at least i. If there are indeed at most ^ bins with load at least / at all times, then the 
probability that each choice yields a bin with load at least i is at most fii/n. Therefore, 
the probability that a ball has height at least / + 1 is at most (^/n ) d . We can use a 
Chernoff bound to conclude that, with high probability, the number of balls of height 
at least i + 1 will be at most 2n{ti,/n) d . That is. if everything works as sketched, then 

We examine this recursion carefully in the analysis and show that 巧 becomes O(lnn) 
when j = In In u/\nd + 0(1). At this point, we must be a bit more careful in our analy¬ 
sis because Chernoff bounds will no longer be sufficiently useful, but the result is easy 
to finish from there. 

The proof is technically challenging primarily because one must handle the condi¬ 
tioning appropriately. In bounding we assumed that we had a bound on . This 
assumption must be treated as a conditioning in the formal argument, which requires 
some care. 

We shall use the following notation: the state at time t refers to the state of the sys¬ 
tem immediately after the rth ball is placed. The variable h(t) denotes the height of 
the rth ball, and V[{t) and fii(t) refer (respectively) to the number of bins with load at 
least i and the number of balls with height at least i at time t. We use v ； and fi, for 
Vi(n) and /ii(n) when the meaning is clear. An obvious but important fact, of which 
we make frequent use in the proof, is that y,(r) < fii(t), since every bin with load at 
least i must contain at least one ball with height at least i. 

Before beginning, we make note of two simple lemmas. First, we utilize a specific 
Chernoff bound for binomial random variables, easily derived from Eqn. (4.2) by let¬ 
ting 5 = 1. 

Lemma 14.2: 


Pr(B(n,p) > 2np) < e _H/ " 3 . 
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The following lemma will help us cope with dependent random variables in the main 
proof. 


Lemma 14.3: Let X\, Xj,..., X n be a sequence of random variables in an arbitrary 
domain, and let Y\,Yi, ■ ■. ,Y n be a sequence of binary random variables with the prop¬ 
erty that Yi = Yj(X],..., Xj). If 


then 


Pr(^ = 1 I < p, 

Pr (二 K > 々 )£ Pr(fi(/z, p) > k). 


Proof: If we consider the Yj one at a time, then each Y/ is less likely to take on the 
value 1 than an independent Bernoulli trial with success probability p, regardless of the 
values of the X { . The result then follows by a simple induction. ■ 


We now begin the main proof. 

Proof of Theorem 14.1: Following the earlier sketch, we shall construct values fi, 
such that, with high probability, Vj(n) < ^ for all i. Let ^ = n/4, and let ^ i+] = 
2^f/n d ~ l for 4 < i < /*, where i* is to be determined. We let be the event that 
Vi(n) < Note that S 4 holds with probability 1; there cannot be more than n /4 bins 
with at least 4 balls when there are only n balls. We now show that, with high proba¬ 
bility, if Si holds then 8 i+ \ holds for 4 < i < i*. 

Fix a value of / in the given range. Let Y, be a binary random variable such that 


F, = 1 if and only if h{t) > ，.十 1 and y,(r — 1) < 


That is, Y, is 1 if the height of the rth ball is at least i + 1 and if, at time r — 1, there 
are at most bins with load at least /. The requirement that Y, be 1 only if there are at 
most ^ bins with load at least i may seem a bit odd; however, it makes handling the 
conditioning much easier. 

Specifically, let coj represent the bins selected by the jth ball. Then 


Pr(y ； = 1 I co\,..., co,_\) < . 

n d 

That is, given the choices made by the first t — 1 balls, the probability that Y, is 1 is 
bounded by (^i/n) cl . This is because, in order for Y, to be 1, there must be at most ^ 


bins with load at least i : and when this condition holds, the d choices of bins for the 


rth ball all have load at least i with probability ( 卢 If we did not force Y, to be 0 
if there are more than 0,. bins with load at least /, then we would not be able to bound 
this conditional probability in this way. 

Let pj = ^ L lJn d . Then, from Lemma 14.3, we can conclude that 


Pr (S 



< Pr(B(n, Pi ) > k). 
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This holds independently of any of the events £ 卜 owing to our careful definition of Y,. 
(Had we not included the condition that Y, = 1 only if Vj{t — 1) < 0,.. the inequality 
would not necessarily hold.) 

Conditioned on 8 /, we have -卜 Since v i+] < "/+ 丨 ， we have 


Pr(u,. 十 i > k \ 6i) < Pr(/i, + 1 > k \ Sj) 


=Pr (卖 K, > 小 ,) 

人 -) 

- Pr(^) 

/ Pr( Bin, p,) > k) 

- Pr(^) ' 


We bound the tail of the binomial distribution by using the Chernoff bound of 
Lemma 14.2. Letting k = 0, + i = 2np-, in the previous equations yields 


Pr(v /+] > ^ i+] I S；) < 


Pr(B()h pi) > 2 np, 


which gives 


Pr(«f, 

Pr(-£： /+ i I Si) < 


e/' ' Pr(^ 


" 2 Pr(& 


(14.2) 


whenever /?,« > 6 \nn. 

We now remove the conditioning by using the fact that 

' Pr(-^, + 1 ) = Pr(-^, + 1 I Si)Pr(Si) + Pr(-'£ ， , + i | )Pr(-^,.) 

< Pr(-^, + i I Sj) Pr(^ ( ) + Pr (，& ): (14.3) 

then, by Eqns. (14.2) and (14.3), 


1 

Pr(-^ + i) < Pr(-^) + — 
/?- 


(14.4) 


as long as /?,/? > 6In/;. 

Hence, whenever /?,/; > 6In;? and £j holds with high probability, then so does 8 l + \. 
To conclude we need two more steps. First, we need to show that /?,./? < 61nn when 
i is approximately In In///Inc/, since this is our desired bound on the maximum load. 
Second, we must carefully handle the case where pin < 6In n separately, since the 
Chernoff bound is no longer strong enough to give appropriate bounds once p t is this 
small. 

Let i* be the smallest value of i such that pi = 内 1 /n d < 6\nn/n. We show that / ，Jk 
is \n\nn/\nd + 0(1). To do this, we prove inductively the bound 


Pi+4 


户 "-o 

This holds true when i = 0, and the induction argument follows: 
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A/+D+4 = 


2^4 

n d —\ 


2 



d 


1 


_ 严 + 1 -E;=〆 

The first line is the definition of 氏 ； the second follows from the induction hypothesis. 
It follows that Pi +4 < n/2 d， and hence that /* is \n\nn/\nd + 0(1). By inductively 
applying Eqn. (14.4), we find that 

n- 

We now handle the case where p；n < 6 In n. We have 


Pr(y/* + , > 181n« | Si*) < ?r(jij* +] > 181n/z | S r *) 


^ Vx{B(n,6\nn/n) > 181n«) 

- Pr(Sr) 

〆 1 

- n 2 Pr(£i*)' 

where the last inequality again follows from the C hern off bound. Removing the con¬ 
ditioning as before then yields 


Pr(y,* + | > 181n«) < Pr(-^/*) + 




To wrap up, we note that 


(14.5) 


Pr(v/* +3 > 1) < Pr ( 卽 +3 > 1) < Pr(/i /*+2 > 2) 


and bound the latter quantity as follows: 


Pr(M /*+2 > 2 I < 181n/;) < 


Pr(fi(/7,(181n/7/«) rf ) > 2) ("Jminn/n)^ 
Pr(v,* + i < 18In/i) — Pr(y,* + i < 18In n) 


Here the last inequality comes from applying the crude union bound; there are ( 2 ) ways 
of choosing two balls, and for each pair the probability that both balls have height at 
least /* 十 2 is {\^>\nn/n) 2d . 

Removing the conditioning as before and then using Eqn. (14.5) yields 


一 1 r I 


5 Pr(M,*+2 


iV+ i 


< 


十 Pr(y/* +] > 18 In «) 
(\S\nn) 2d i* + 1 


18In«) Pr(y,* + i < 18 In 


.Id-2 
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showing that Pr(v ,- + 3 > 1) is o{\/n) ford > 2 and hence that the probability the max¬ 
imum bin load is more than + 3 = lnln/?/hu/ + (9(1) is o{\/n). ■ 

Breaking ties randomly is convenient for the proof, but in practice any natural tie¬ 
breaking scheme will suffice. For example, in Exercise 14.1 we show that if the bins 
are numbered from 1 to n then breaking ties in favor of the smaller-numbered bin is 
sufficient. 

As an interesting variation, suppose that we split the n bins into two groups of equal 
size. Think of half of the bins as being on the left and the other half on the right. 
Each ball now chooses one bin independently and uniformly at random from each half. 
Again, each ball is placed in the least loaded of the two bins - but now, if there is a 
tie, the ball is placed in the bin on the left half. Surprisingly, bv splitting the bins and 
breaking ties in this fashion, we can obtain a siiglitly better bound on the maximum 
load: lnlnn/21n((l + ^)/2) + 0(1). One can generalize this approach by splitting 
the bins into d ordered equal-sized groups; in case of a tie for the least-loaded bin, the 
bin in the lowest-ranked group obtains the ball. This variation is the subject of Exer¬ 
cise 14.13. 


14.2. Two Choices: The Lower Bound 


In this section we demonstrate that the result of Theorem 14.1 is essentially tight by 
proving a corresponding lower bound. 


Theorem 14.4: Suppose that n balls are sequentially placed into n bins in the follow¬ 
ing manner. For each ball, d > 2 bins are chosen independently and wufornily at 
random {with replacement). Each ball is placed in the least full of the d bins at the 
time of the placement, with Ties broken randomly. After all the balls are placed, the 
maximum load of any bin is at least In In n/ln d — 0(1) with probability 1 — o(\/n). 

The proof is similar in spirit to the upper bound, but there are some key differences. As 
with the upper bound, we wish to find a sequence of values )/, such that the number of 
bins with load at least / is bounded below by }/,■ with high probability. In deriving the 
upper bound, we used the number of balls with height at least / as an upper bound on 
the number of bins with height at least /. We cannot do this in proving a lower bound, 
however. Instead, we find a lower bound on the number of balls with height exactly / 
and then use this as a lower bound on the number of bins with height at least /. 

In a similar vein, for the proof of the upper bound we used that the number of bins 
with at least i balls at time n was at least Vj(t) for any time t < n. This is not helpful 
now that we are proving a lower bound; we need a lower bound on r,( t ). not an upper 
bound, to determine the probability that the rth ball has height / + 1. To cope with this, 
we determine a lower bound y, on the number of bins with load at least / that exist at 
time n{\ — \/2 ! ) and then bound the number of balls of height / + 1 that arise over the 
interval 0(1 — 1/2 ( ), n(l — l/2 , + l )J. This guarantees that appropriate lower bounds 
hold when we need them in the induction, as we shall clarify in the proof. 
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We state the lemmas that we need, which are similar to those for the upper bound. 


Lemma 14.5: 


Pr{B(n,p) < np/2) < e _H/,/8 . (14.6) 

Lemma 14.6: Let X\, X 2 ,..., X n be a sequence of random variables in an arbitrary 
domain, and let Y\,Yi,..., Y n be a sequence of binary random variables with the prop¬ 
erty that Yi = YjiX],Xi). If 

Pr(^ = 1 I > p, 

then 


Pr (iZ Yl> k 

Proof of Theorem 14.4: Let be the event that Vj(n(\ — \/2')) > y/, where y, is 
given by: 


j > Pr(B(n,p) > k) 


yo = n\ 


y,+\ 


Clearly holds with probability 1. We now show inductively that successive Ti hold 
with sufficiently high probability to obtain the desired lower bound. 

We want to compute 


Pr(-'J r /+ i I ^i)- 

With this in mind, for t in the range R = (n( \ — \/2 l ),n{\ — l/2 , + l )], define the binary 
random variable by 


Z, = 1 if and only if h(t) = / + 1 or v i+ \(T — 1) > y i+ \. 


Hence Z, is always 1 if v i+ \{t — 1) > y i+] . 

The probability that the rth ball has height exactly / + 1 is 

-1) j y _ ( w+a - 1) y 

The first term is the probability that all the d bins chosen by the rth ball have load at 
least i. This is necessary for the height of the nh ball to have height exactly / + 1. 
However, we must subtract out the probability that all d choices have at least i + 1 
balls, because in this case the height of the ball will be larger than ，■十 1. 

Again letting ojj represent the bins selected by the y'th ball, we conclude that 


Pr(Z, = 1 I 叫 ，. .> 



Yi+\ 

n 


r 


This is because Z, is automatically 1 if v i + ] (r - 1) > y/ + i ： hence we can consider the 
probability in the case where v i+ \(t - 1) < y i+ \. Also, conditioned on Tj, we have 
Vi(t - 1) > y：. 
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From the definition of the y, we can further conclude that 

Pr(Z ; = l |^, ， .… 叫 — ⑸ —(¥) > 

Let pi = \{y } /n) d . 

Applying Lemma 14.6 yields 


Now our choice of /,• nicely satisfies 

1 n 

K+i = 21 ^ Pl ' 

By the Chernoff bound, 


Pr 




< Yi+\ 



which is o(\/n 2 ) provided that pjn/2 l+l > 17Inn. Let /* be a lower bound on the 
largest integer for which this holds. We subsequently show that i* can be chosen to be 
\n\nn/\nd - 0(1); for now let us assume that this is the case. Then, for i < /'*, we 
have shown that 


Pr (j^ Z, < Yi+] 



Yi+\ 



Further, by definition we have that ^Z IeR Z, < y i+] implies —'J-'i+i- Hence, for i < i*, 
Pr(^ i+l I < Pr^^Z, < y i+] \ T- t 

、 i&R 

Therefore, for sufficiently large n. 



Pr(JS.*) > Pr ⑺ -I | J r /*_ 2 ) - - - Pr^, \ J^o) ■ Pr^o) 

> (1 - \/n 2 f 
=1 — o(\jn). 


All that remains is to demonstrate that \n\nn/\nd — 0(1) is indeed an appropriate 
choice for /*. It suffices to show that y-, > 17 In n when i is In In n/\n d — 0(1). From 
the recursions y, + i = yf/(2 l+3 n d _ \ we find by a simple induction that 

n 

Vi — - — ； - . 

20 2 _ k 、 dk 

A very rough bound gives 


Vi > 


n 

210/- 1 


We therefore look for the maximum i such that 


n 

2i(w ,_l 


> \l\nn 
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For n sufficiently large, we find that we can take i as large as \n\nn/\nd — 0(1) by 
using the following chain of inequalities: 


n 

- —r > 17 In /?, 

2i(W ( -' - , 

0 i(w , - 1 < n 

‘ - \l\nn 

lOc /’ 一 1 < log 2 n — log 2 (17 ln/ 2 ), 


d l - x < 


In/?, 


20 
In In n 
\nd 


0 ( 1 ). 


14.3. Applications of the Power of Two Choices 


The balanced allocation paradigm has a number of interesting applications to comput¬ 
ing problems. We elaborate here on two simple applications. When considering these 
applications, keep in mind that the In Inn/In d 十 (9(1) bound we obtain for the balanced 
allocation paradigm generally corresponds to a maximum load of at most 5 in practice. 

14.3.1. Hashing 

When we considered hashing in Chapter 5, we related it to the balls-and-bins paradigm 
by assuming that the hash function maps the items being hashed to random entries in 
the hash table. Subject to this assumption, we proved that (a) when 0(n) items are 
hashed to a table with n entries, the expected number of items hashed to each individ¬ 
ual entry in the table is 0(1), and (b) with high probability, the maximum number of 
items hashed to any entry in the table is 0( In/?/In \nn). 

These results are satisfactory for most applications, but for some they are not, since 
the expected value of the worst-case lookup time over all items is Q(\nn/\n\nn). For 
example, when storing a routing table in a router, the worst-case time for a lookup in 
a hash table can be an important performance criterion, and the 0(ln«/ln In/?) result 
is too large. Another potential problem is wasted memory. For example, suppose that 
we design a hash table where each bin should fit in a single fixed-size cache line of 
memory. Because the maximum load is so much larger than the average, we will have 
to use a large number of cache lines and many of them will be completely empty. For 
some applications, such as routers, this waste of memory is undesirable. 

Applying the balanced allocation paradigm, we obtain a hashing scheme with 0(1) 
expected and 0( In In «) maximum access time. The 2-way chaining technique uses 
two random hash functions. The two hash functions define two possible entries in the 
table for each item. The item is inserted to the location that is least full at the time of 
insertion. Items in each entry of the table are stored in a linked list. If n items are se¬ 
quentially inserted into a table of size n, the expected insertion and lookup time is still 
(9(1). (See Exercise 14.3.) Theorem 14.1 implies that with high probability the maxi¬ 
mum time to find an item is 0(lnln«), versus the 0(ln«/lnln/7) time when a single 
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random hash function is used. This improvement does not come without cost. Since a 
search for an item now involves a search in two bins instead of one. the improvement in 
the expected maximum search time comes at the cost of roughly doubling the average 
search time. This cost can be mitigated if the two bins can be searched in parallel. 

14.3.2. Dynamic Resource Allocation 

Suppose a user or a process has to choose on-line between a number of identical re¬ 
sources (choosing a server to use among servers in a network; choosing a disk to store a 
directory; choosing a printer; etc.). To find the least loaded resource, users may check 
the load on all resources before placing their requests. This process is expensive, since 
it requires sending an interrupt to each of the resources. A second approach is to send 
the task to a random resource. This approach has minimal overhead, but if all users 
follow it then loads will vary significantly among servers. The balanced allocation 
paradigm suggests a more efficient solution. If each user samples the load of two re¬ 
sources and sends his request to the least loaded one, then the total overhead remains 
small while the load on the n resources varies much less. 


14.4. Exercises 

Exercise 14.1: (a) For Theorems 14.1 and 14.4. the statement of the proof is for the 
case that ties are broken randomly. Argue informally that, if the bins are numbered 
from 1 to n and if ties are broken in favor of the lower-numbered bin. then the theorems 


still hold. 

(b) Argue informally that the theorems apply to any tie-breaking mechanism that 
has no knowledge of the bin choices made by balls that hav e not yet been placed. 


Exercise 14.2: Consider the following variant of the balanced allocation paradigm: 
n balls are placed sequentially in n bins, with the bins labeled from 0 to n — 1. Each 
ball chooses a bin / uniformly at random, and the ball is placed in the least loaded of 

bins / , / + 1 mod/?, i + 2 mod n __ / + c/ — 1 mod/?. Argue that, when d is a constant, 

the maximum load grows as (-)(In "/In In /?). That is, the balanced allocation paradigm 
does not yield an 0( In In n ) result in this case. 

Exercise 14.3: Explain why. with 2-way chaining, the expected time to insert an item 
and to search for an item in a hash table of size n with n items is 0(1). Consider two 
cases: the search is for an item that is in the table; and the search is for an item that is 
not in the table. 

Exercise 14.4: Consider the following variant of the balanced allocation paradigm: 
n balls are placed sequentially in n bins. Each ball comes with d choices, chosen 
independently and uniformly at random from the n bins. When a ball is placed, we 
are also allowed to move balls among these d bins to equalize their load as much us 
possible. Show that the maximum load is still at least In In n/\n d — 0( 1) with proba¬ 
bility 1 — o(\/n) in this case. 
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Exercise 14.5: Suppose that in the balanced allocation setup there are n bins, but the 
bins are not chosen uniformly at random. Instead, the bins have two types: 1/3 of the 
bins are type A and 2/3 of the bins are type B. When a bin is chosen at random, each 
of the type-A bins is chosen with probability 2/n and each of the type-B bins is chosen 
with probability \/2n. Prove that the maximum load of any bin when each ball has d 
bin choices is still In \r\n/\nd + 0(1). 


Exercise 14.6: Consider a parallel version of the balanced allocation paradigm in 
which we have n jk rounds, where k new balls arrive in each round. Each ball is placed 
in the least loaded of its d choices, where in this setting the load of each bin is the load 
at the end of the previous round. Ties are broken randomly. Note that the k new balls 
cannot affect each other’s placement. Give an upper bound on the maximum load as a 
function of n, d, and k. 

Exercise 14.7: We have shown that sequentially throwing n balls into n bins randomly, 
using two bin choices for each ball, yields a maximum load of In In «/In 2 + 0(1) with 
high probability. Suppose that, instead of placing the balls sequentially, we had ac¬ 
cess to all of the 2n choices for the n balls, and suppose we wanted to place each ball 
into one of its choices while minimizing the maximum load. In this setting, with high 
probability, we can obtain a maximum load that is constant. 

Write a program to explore this scenario. Your program should take as input a param¬ 
eter k and implement the following greedy algorithm. At each step, some subset of the 
balls are active; initially, all balls are active. Repeatedly find a bin that has at least one 
but no more than k active balls that have chosen it, assign these active balls to that bin, 
and then remove these balls from the set of active balls. The process stops either when 
there are no active balls remaining or when there is no suitable bin. If the algorithm 
stops with no active balls remaining, then every bin is assigned no more than k balls. 

Try running your program with 10,000 balls and 10,000 bins. What is the smallest 
value of k for which the program terminates with no active balls remaining at least four 
out of five times? If your program is fast enough, try experimenting with more trials. 
Also, if your program is fast enough, try answering the same question for 100,000 balls 
and 100,000 bins. 

Exercise 14.8: The following problem models a simple distributed system where 
agents contend for resources and back off in the face of contention. As in Exercise 5.1 L 
balls represent agents and bins represent resources. 

The system evolves over rounds. In the first part of every round, balls are thrown 
independently and uniformly at random into n bins. In the second part of each round, 
each bin in which at least one ball has landed in that round serves exactly one ball from 
that round. The remaining balls are thrown again in the next round. We begin with n 
balls in the first round, and we finish when every ball is served. 


(a) Show that, with probability 1 — o(\/n), this approach takes at most log 2 log 2 n + 
0(1) rounds. {Hint: Let b k be the number of balls left after k rounds; show that 
bk+\ < c{bk) 2 /n, for a suitable constant c with high probability, as long as bk^\ is 
sufficiently large.) 
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(b) Suppose that vve modify the system so that a bin accepts a ball in a round if and 
only if that ball was the only ball to request that bin in that round. Show that, again 
with probability 1 — o( 1 //?), this approach takes at most log ： log 2 /? + 0( 1) rounds. 

Exercise 14.9: The natural way to simulate experiments with balls and bins is to cre¬ 
ate an array that stores the load at each bin. To simulate 1,000,000 balls being placed 
into 1.000,000 bins would require an array of 1,000,000 counters. An alternative ap¬ 
proach is to keep an array that records in the jth cell the number of bins with load j. 
Explain how this could be used to simulate placing 1,000,000 balls into 1,000,000 bins 
using the standard balls-and-bins paradigm and the balanced allocation paradigm with 
much less space. 


Exercise 14.10: Write a program to compare the performance of the standard balls- 
and-bins paradigm and the balanced allocation paradigm. Run simulations placing n 
balls into n bins, with each ball having d — 1,2, 3, and 4 random choices. You should 
try n = 10,000. n = 100,000, and n = 1,000,000. Repeat each experiment at least 100 
times and compute the expectation and variance of the maximum load for each value 
of d based on your trials. You may wish to use the idea of Exercise 14.9. 


Exercise 14.11: Write a simulation showing how the balanced allocation paradigm 
can improve performance for distributed queueing systems. Consider a bank of n FIFO 
queues with a Poisson arrival stream of customers to the entire bank of rate 入 /? per 
second, where 入 < 1. Upon entry a customer chooses a queue for service, and the 
service time for each customer is exponentially distributed with mean 1 second. You 
should compare two settings: (i) where each customer chooses a queue independently 
and uniformly at random from the n queues for service; and (ii) where each customer 
chooses two queues independently and uniformly at random from the n queues and 
waits at the queue with fewer customers, breaking ties randomly. Notice that the first 
setting is equivalent to having a bank of/? M/M/ 1 FIFO queues, each with Poisson ar¬ 
rivals of rate a < 1 per second. You may find the discussion in Exercise 8.26 helpful 
in constructing your simulation. 

Your simulation should run for t seconds, and it should return the average (over all 
customers that have completed service) of the time spent in the system as well as the av¬ 
erage (over all customers that have arrived) of the number of customers found waiting 
in the queue they selected for service. You should present results for your simulations 
for n — 100 and for t = 10.000 seconds, with 入 = 0.5, 0.8, 0.9, and 0.99. 

Exercise 14.12: Write a program to compare the performance of the following vari¬ 
ation of the standard balls-and-bins paradigm and the balanced allocation paradigm. 
Initially n points are placed uniformly at random on the boundary of a circle of circum¬ 
ference 1. These n points divide the circle into n arcs, which correspond to bins. We 
now place n balls into the bins as follows: each ball chooses d points on the boundary 
of the circle, uniformly at random. These d points correspond to the arcs (or. equi\a- 
lently, bins) that they lie on. The ball is placed in the least loaded of the d bins, breaking 
ties in favor of the smallest arc. 
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Run simulations placing n balls into n bins for the cases d = 1 and d = 2. You 
should try n = 1,000, n = 10,000, and n = 100,000. Repeat each experiment at least 
100 times; for each run, the n initial points should be re-chosen. Give a chart showing 
the number of times the maximum load was k, based on your trials for each value of d. 

You may note that some arcs are much larger than others, and therefore when d = 
1 the maximum load can be rather high. Also, to find which bin each ball is placed in 
may require implementing a binary search or some other additional data structure to 
quickly map points on the circle boundary to the appropriate bin. 


Exercise 14.13: There is a small but interesting improvement that can be made to the 
balanced allocation scheme we have described. Again we will place n balls into n bins. 
We assume here than n is even. Suppose that we divide the n bins into two groups of 
size n/2. We call the two groups the left group and the right group. For each ball, we 
independently choose one bin uniformly at random from the left and one bin uniformly 
at random from the right. We put the ball in the least loaded bin, but if there is a tie we 
always put the ball in the bin from the left group. With this scheme, the maximum load 
is reduced to lnlnw/21n0 + 0(1), where 0 = (1 + )/2 is the golden ratio. This 

improves the result of Theorem 14.1 by a constant factor. (Note the two changes to our 
original scheme: the bins are split into two groups, and ties are broken in a consistent 
way; both changes are necessary to obtain the improvement we describe.) 


(a) Write a program to compare the performance of the original balanced allocation 
paradigm with this variation. Run simulations placing n balls into n bins, with 
each ball having d = 2 choices. You should try n = 10,000, n = 100,000, and 
n = 1,000,000. Repeat each experiment at least 100 times and compute the expec¬ 
tation and variance of the maximum load based on your trials. Describe the extent 
of the improvement of the new variation. 

(b) Adapt Theorem 14.1 to prove this result. The key idea in how the theorem’s proof 
must change is that we now require two sequences, ^ and y,. Similar to Theo¬ 
rem 14.1, represents a desired upper bound on the number of bins on the left 
with load at least i , and y\ is a desired upper bound on the number of bins on the 
right with load at least i. Argue that choosing 


A.+i = 


n 2 


and y i+l 


C2Pi + \Vi 


for some constants c\ and c 2 is suitable (as long as and y ； are large enough that 
Chernoff bounds may apply). 

Now let F k be the kih Fibonacci number. Apply induction to show that, for suf¬ 
ficiently large /, < nc^c^ 1 ' and y, < ncT,c^ 2i+l for some constants C 3 and c 4 . 

Following Theorem 14.1, use this to prove the In \nn/2 hup + 0(1) upper bound, 

(c) This variation can easily be extended to the case of d > 2 choices by splitting the 
n bins into d ordered groups, choosing one bin uniformly at random from each 
group, and breaking ties in favor of the group that comes first in the ordering. Sug¬ 
gest what would be the appropriate upper bound on the maximum load for this 
case, and give an argument backing your suggestion. (You need not give a com¬ 
plete formal proof.) 
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“Probability is part of the conceptual core of modern computer science. 
Probabilistic analysis of algorithms, randomized algorithms, and probabilistic 
combinatorial constructions have become fundamental tools for computer science 
and applied mathematics. This book provides a thorough grounding in discrete 
probability and its applications in computing, at a level accessible to advanced 
undergraduates in the computational, mathematical, and engineering sciences.” 

一 Richard M. Karp, University Professor, Uniuersity of California at Berkeley 

“An exciting new book on randomized algorithms. It nicely covers all the basics, 
and also has some interesting modern applications for the more advanced 
student.” 
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Randomization and probabilistic techniques play an important role 
in modem computer science, with applications ranging from combinatorial opti¬ 
mization and machine learning to communication networks and secure protocols. 

This textbook is designed to accompany a one- or two-semester course for 
advanced undergraduates or beginning graduate students in computer science and 
applied mathematics. It gives an excellent introduction to the probabilistic tech¬ 
niques and paradigms used in the development of probabilistic algorithms and 
analyses. It assumes only an elementary background in discrete mathematics and 
gives a rigorous yet accessible treatment of the material, with numerous examples 
and applications. 

The first half of the book covers core material, including random sampling, 
expectations, Markov’s inequality, Chebyshev，s inequality, ChernofFbounds, balls- 
and-bins models, the probabilistic method, and Markov chains. In the second half, 
the authors delve into more advanced topics such as continuous probability, appli¬ 
cations of limited independence, entropy, Markov chain Monte Carlo methods, 
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